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ABSTRACT 

In this thesis a new type of information retrieval system is 
suggested which utilizes data of the type generated by the users of the 
system instead of data generated by indexers. 

The theoretical model on which the system is based consists of ( 
three basic elements. The first element is a measure of the related- 
ness between document-pairs. It is derived from information theory. 
The second element is a definition of what constitutes a set (cluster) 
of inter-related documents. This definition is based on the measure of 
relatedness. The last element is a procedure which transforms a request 
for information into a cluster of answer documents. 

t 

Requests are made by designating one or more documents to be of 
interest and perhaps some to be of no interest. The requestor can * 

continue to interact with the procedure as it locates the answer cluster * 
by specifying as interesting or not interesting other documents which 
are presented to him. The answer cluster which is generated is auto- 
matically made as small (specific) or as large (general) as is desired, 
depending on the initial request and the subsequent interactions. 

An experimental system was developed to test the model in a 
realistic environment. It was programmed for the Project MAC time- 
sharing system and utilized the physics data file of the Technical 
Information Project. Citations were used as the data base for the 
measure of relatedness. A file structure and retrieval language were 
designed which allowed close man-machine coupling. 

Experiments were conducted which compared the clusters of docu- 
ments produced by the experimental system with various sets of documents 
of known mutual pertinence. These sets included bibliographies from 
review articles, subject categories, and sets of documents found to be 
of interest to selected users of the system. It was found that between 
60-90 % of the documents of known pertinence were included in the 
corresponding clusters. Ways of improving this retrieval efficiency 
even further are suggested. 

Thesis Supervisor: Robert M. Fano 
Title: Ford Professor of Engineering 
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PART ONE: INTRODUCTION 

This thesis is divided into four parts. In 
this part we introduce the project by describing 
results of related work and by discussing the 
objectives of the research. In Part Two the 
theoretical model on which the project is based 
is presented. Part Three contains a description 
of the experimental system which was developed to 
test the model. In the final part we present the 
experimental results and the conclusions about the 
theoretical model that can be drawn from them. 
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CHAPTER I 
BACKGROUND 

1.1 Introduction 

In a pioneering article written at the close of World War II, Dr. 
Vannevar Bush, Director of the Office of Scientific Research and Develop- 
ment, called on scientists to redirect their energies to creating "a new- 
relationship between thinking man and the sum of our knowledge." He 
noted that "our methods of transmitting and reviewing the results of 
research are generations old and by now are totally inadequate." 

His challenge to mechanize and streamline the library process has 
been accepted by numerous groups in the intervening twenty years. A 
large number of devices have been developed which mechanically or 
electronically select information from a store. Methods of automatically 
indexing, classifying, and abstracting documents have been devised. A 
myriad of other disciplines have been called in for assistance. 

Before attempting to review and evaluate this activity, it is 
extremely important that the implied "inadequacies" of traditional 
library methods be clearly defined. Only then can one hope to deter- 
mine the effectiveness of any given approach in resolving these problems. 

1 .2 Areas Needing Improvement 

Six general aspects of library systems have been chosen as impor- 
tant areas which need improvement and which appear to be amenable to 
improvement through some type of mechanization. Most information 



Ill 



storage and retrieval projects have had as their stated or implied goals 
one or more of these objectives. 

1.21 Closer Man-System Coupling 

In many cases a user who comes to an information system cannot 
state precisely what he wants. He has a very real need for information, 
but he cannot define exactly what that need is verbally. In other 
cases a user can accurately specify his interests but changes his mind 
as to what he wants when he finds that there are too many or too few 
articles which satisfy the request. 

Unfortunately most systems (automatic and manual) are designed for 
that rare individual who knows exactly what he wants and what the stack 
contains. In these systems there is a clear demarkation between request 
specification by the user and answer presentation by the system. 

A much closer coupling of man and system is generally needed so 
that each can contribute to the best of his (its) ability at each step 
in the search. For example, the system might help the user in formulating 
the request by noting with each change in the request the probable number 
of documents in the final answer, by presenting representative documents 
for evaluation, and by ranking the output according to degree of related- 
ness. The user, on the other hand, could help the system find the desired 
answer by catching and correcting possible misunderstandings of the 
request as early in the search as possible, by narrowing or broadening 
the request if the size of the expected answer becomes too large or too 
small, and by continually refining the request based on the information 
supplied by the system. 
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1.22 More Flexibility in Requests 

Even if it Is assumed that a user can adequately specify his 
interests, there is still the difficulty of matching his request vocab- 
ulary with the vocabulary of the indexer. Perhaps the user is looking 
for books on "information retrieval" but fails to realize that the 
classifier posted such books under "documentation". Of course, the 
classifier may have foreseen this difficulty and placed a "see" card 
under information retrieval. However, this does not always occur. 

Another basic problem is faced by the person who knows a given 
paper or a given author of interest but is forced to translate this 
knowledge into a set of descriptors instead of being able to feed it 
in directly as a request. 

More flexibility is needed in the allowable vocabulary, language 
structure, and type of information which can be specified in a request. 

1.23 Physical Barriers 

The mere physical separation of the user from the library presents 
a barrier that has a greater impact than we may realize. This is also 
true of the separation of the card file from the stacks. Evidence of 
the importance of this factor is found in the popularity of small 
special collections distributed throughout a large organization and in 
the personal libraries maintained by most research workers. 

There is also the time barrier. If a person could get an answer to 
his problem in five minutes, he might be interested. Whereas he might 
decide to bypass the problem if it takes one-half hour or more. A 
third barrier is cost. This factor is not a direct consideration to the 
user in most cases because no direct fee is levied for use of a library. 
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1.21; Quality of Selection Information 

All libraries provide the user with certain types of information 
which help him to select from the total store those books which are of 
interest to him without having to scan the text of each book. Even 
those libraries which cater to the browser generally arrange books by 
content on the shelves and place the spine out so that the title and 
author can be seen at a glance. 

There are at least three important factors which must be considered 
in the generation of selection information for a given document. 

1. The actual contents of the document. 

2. The collection in which the document will reside. 

3. The needs and characteristics of the user population 

serviced by the collection. 

If the only factor to be considered in indexing were the contents 
of the document, then a valid method for indexing would be to have each 
author, as the final authority on what the document contains, index it. 
However, libraries have found that the other two factors are also 
important and that an author cannot be expected to be familiar with 
each library and each user population that might have his book or 
article. 

The approach used by conventional libraries is to rely on an 
indexer or classifier to generate the selection information needed. 
This type of individual is usually an expert on the contents of the 
library collection, but knows much less about the first and third 
factors. He usually has about 10-15 minutes' time to determine what 
the author of the document has said and predict the types of users this 
information will be of interest to (through the categories selected); 
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all this with little direct involvement in the field or ares in question. 
The amazing part about the whole process is that an indexer can some- 
times come up with a sketchy, but fairly useful portrayal of the docu- 
ment. 

An additional problem is that much of the literature (periodicals, 
technical reports, etc.) never even receives the attention of an indexer. 

1.25 Restrictive Classification Model 

Even if the classifier were able to determine the exact contents of 
a document, he would still find difficulty in fitting his findings into 
the rigid classification systems currently in use (Dewey Decimal, 
Library of Congress, etc.). 

First, the classifier is allowed only a yes-no type of response. 
Either the document is placed in a given category or it is not— there is 
no middle ground, no partial relationship. 

Next there is the "broken relationship" problem inherent in hier- 
archal classification structures. No matter where a category is placed 
in the hierarchy tree, there are related fields to which it cannot be 
adjacent. For example, if the history of physics is placed in the 
science area, it loses its connection to history and vice-versa. This 
problem is only partially alleviated by the "see" and "see also" 
artifices. 

Third, there is the difficulty encountered in changing a classifica- 
tion structure to fit with our current body of knowledge. This involves 
considerable expansion and contraction of areas along with insertion of 
entirely new fields and the deletion of obsolete ones. The old classi- 
fication framework eventually becomes so strained in certain areas that 
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there is danger of collapse. 

Each of these difficulties encountered in the classification of 

documents generates a corresponding difficulty for the user. V. Bush 

described the use of a classification system in this way. 

"...information is found (when it is) by tracing it down 
from subclass to subclass. It can be in only one place, 
unless duplicates are used; one has to have rules as to which 
path will locate it, and the rules are cumbersome. Having 

found one item, moreover, one has to emerge, from the system 

nlO 
and re-enter on a new path. 

1.26 Meed for Dynamic Indexing 

Consideration of the problem of indexing leads one to the con- 
clusion that there is no intrinsic content to a document which, when 
once properly characterized by an appropriate set of words or phrases, 
is then adequately indexed for all situations and all users. In reality 
the depth and type of indexing needed depends both on the character- 
istics of the collection in which the document is imbedded and on the 
interests of the user population to be serviced by the collection at 
the time. 

Once this point is conceded then it becomes apparent that the way 
a document is indexed must change as the collection and user population 
vary. One of the major drawbacks of conventional indexing methods is 
that in practice they are static. A document, once indexed, is almost 
never re-indexed. Indeed some people believe that a properly indexed 
document should never need re-indexing. R. A. Pairthorne claims the 
following — 
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"We have to assume that a classifier can decide that a 
text is relevant to a topic in such a way that, apart from 
blunders, neither future development nor decisions elsewhere 
shall compel revision. Future developments certainly should 
not upset any decision about relevance; if an item is relevant 
to some topic, it will always be relevant, though the relevance 
may become unimportant and new relevancies may be added." 

The case for dynamic indexing was clearly presented by M. M. 

Kessler: 

"Indexing must be fluid and dynamic, reflecting the 
changing needs of society and the contributions of new insights. 
It is most unlikely that anybody, be he expert scientist or 
expert indexer, can read a given paper at a given time and see 
enough of its implications to classify it once and for all. If 
this philosophy of classification were accepted, as it now is, 
the resulting system would impose such a rigidity upon the flow 
of information that the working scientist would be forced to 
ignore it." 26 

1.3 Evaluation of Previous Efforts 

It would be impossible to describe all of the work which has been 
undertaken in the field of information retrieval and documentation in 
the last 20 years. What will be attempted here is an analysis of cer- 
tain representative efforts in each of six broad areas. 

1.31 Hardware Developments 

Many interesting machines have been developed for use in informa- 
tion processing (Rapid Selector, Peekaboo, Zator, Walnut, Minicard, 
general purpose computers, etc.). Instead of discussing the specific 
capabilities of these machines, let us note some of the general trends 
in hardware development which promise to have the greatest impact on 
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information retrieval. 

The first would "be the development of multiply-accessed (time- 
sharing) computers. A research worker with a connection to such a 
computer would be able to query a large central store of information 
directly from his office, laboratory, or home and receive an almost 
immediate response. This is in contrast to the batch-processing com- 
puter which processes requests in groups at a central location and 
usually involves delays in response of from several hours to several 
days. A brief description of a particular time-sharing system (the one 
used by this research project) can be found in Sec. 6.1. 

A system of users interacting with a large central information 
store through a time-shared computer offers another important capability 
that might be overlooked. Not only can the user obtain information 
from the system, but the system can also monitor the user. This moni- 
tored usage data could be collected at little or no inconvenience to 
the user. It would complete the information loop with feedback from 
the user continually modifying and improving system performance. 

Another significant hardware advancement is the development of 
larger and larger mass memories. It is estimated that all of the text- 
ual information in the 20 million documents in the Library of Congress 

could be stored in a 10 trillion -bit (10 ) memory. Current random 

9 10 
access devices store 10 - 10 bits, while large magnetic tape install- 
ations have a capacity of 10 bits. Random access storage devices have 

12 
been announced in the 10 bit range. It would appear that continued 

progress may soon eliminate storage capacity as a limiting factor in 

the mechanization of large information retrieval systems. 
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A parameter closely related to memory size is access time. 

o 
Typical access times to any part of a 10 -bit file on a random access 

disc are currently 100 ms. The real problem is in knowing which part 

of the file to read. Perhaps associative memories, complete file 

inversion, or some other artifice will resolve this problem. 



1 .32 Indexing Methods and Models 

As important as hardware developments are, V. Bush pointed out an 

even more basic problem. 

"The real heart of the matter of selection, however, 
goes deeper than a lag in the adoption of mechanisms by 
libraries, or a lack of development of devices for their 
use. Our ineptitude in getting at the record is largely 
caused by the artificiality of systems of indexing." 

The 'systems of indexing' to which Bush referred are, of course, 
the traditional subject catalog and classification schemes still in use 
(Universal Decimal, Library of Congress, etc.). Some of the drawbacks 
of these classification systems were discussed in Section 1.25. 

Beginning about 19f?0 efforts were made to replace these convention- 
al classification methods. One result was "coordinate indexing." In 
coordinate indexing documents are assigned Uniterms or descriptors 
(usually single words). These descriptors are given no hierarchal or 
other structure. A request consists of certain descriptors connected 
by the logical and-or-not operations. 

Coordinate indexing eliminated many of the difficulties encountered 
in hierarchal classifications and subject catalogs. However, its 
strength was also its shortcoming. The elimination of all order and 
structure from the descriptors introduced many 'false drops'. For 
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example, a hypothetical user looking for papers on the causes of blind- 
ness in Venice might also retrieve articles on the design of Venetian 
blinds. To reintroduce that which was lost by eliminating descriptor 
context and order, such features as role indicators were used. 

Currently some workers in the field seem to be disenchanted with 
coordinate indexing and have shifted reluctantly back to the conventional 
classification methods. 

Another field of endeavor was in the modeling area. A number of 
models were proposed which described the indexing and retrieval functions* 
Unfortunately that was all that these models did - they provided an 
alternate way of describing an already familiar problem. Ho new insights 
were gained and no helpful procedures resulted. 

1.33 Hew Bases for Selection Information 

It has already been noted that all library systems depend on 
selection information (classification categories, subject headings, 
author Indexes, etc.) to locate documents relevant to a particular 
request. Customary library practice is to depend on the indexer to 
produce this information. Section 1.2li outlines some of the diffi- 
culties inherent to this dependence. 

Studies during the past eight years have been undertaken to see if 
selection information generated by indexers can be supplemented and per- 
haps replaced by that generated by the automatic processing of a docu- 
ment's contents. 

At first simple methods of exploiting the information found in a 

document were tried. Permuted title indexes and citation indexes met 

31 
with some success. In 1958 Luhn proposed automatic abstracting. 
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This consisted of the selection of certain words as the keywords of a 
document based on their frequencies of occurrence. The sentences and/ 
or phrases which contained these words were then extracted to form the 
auto-abstract of the document. The idea was then extended by Maron in 

196l to the automatic indexing of documents with the keywords extracted 
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becoming the descriptors. ' 

Automatic indexing was about 5>0 % successful in assigning documents 
to the same categories that the human indexer did. This mediocre 
showing can be attributed to the fact that machine indexing did not 
make use of the order, context, syntax and synonyms of the words 
extracted. This in essence is the same difficulty found in coordinate 
indexing. Some of the subsequent efforts at automatic indexing 
attempted to account for syntax, but this trail encountered the same 
massive obstacles that had already slowed progress in automatic language 
translation. 

Thus after some initial success, the automatic generation of 
selection information based on document contents ran aground. One 
cannot dispute the fact that a description of the subject covered by 
the article is contained within the article. Just how one can capitalize 
on that knowledge is the problem. The needed information is there, but 
machines and indexers currently can extract only a part of it. 

There is one notable exception to the above comments. The 
citations found in articles do not have the same type of synonym and 
syntax problems that textual material does. Thus selection information 
generated from citations has had considerable success for those bodies 
of literature which have a good citation base. 

A discussion of the user of a library as a source of selection 
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information will be postponed until Chapter II, since little, if any, 
prior experimental work has been done in this area. 

1.3U Measures of Relevance 

In conventional library systems documents are assigned to 
categories and subject headings on a yes-no sort of basis. Either the 
document is in the category or it is not— there is no middle ground. 
The restrictive nature of this type of arrangement was pointed out by 
Maron and Kuhns in i960. They proposed that an 8 -value weighted 
indexing scheme be used to represent the degree to which a document is 
related to a term. 

This idea was extended to thesauri by Stiles in 1961. A tradi- 
tional thesaurus allows terms to be listed as synonyms or antonyms but 
the degree of synonymity is left unspecified. Stiles proposed an 
association factor to represent the amount of synonymity between terms. 

Numerous other 'measures of relevance' between the various 
entities of libraries have been proposed since. Some of the better 
known of these measures are tabulated in Appendix A. Unfortunately, 
there appears to be considerable confusion over exactly what these 
measures represent, and the use of the term 'relevance' would seem to 
add to this confusion. 

Many documentalists now speak with some assurance about the amount 
(to 3 or U significant figures) of 'relevance' of a document to a 
category or to a request. The 'relevance ratio" is an accepted way to 
measure information retrieval system efficiency. All too often these 
comments leave one with the impression that there is some intrinsic 
meaning to a word or document which has now been quantitatively described, 
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when in reality all that has been accomplished is the invention of some 

type of frequency ratio. 

In traditional library work confusion also appears to exist. Indeed 

the very idea of classification implies to some that there is some 

inherent content of a document which must be indexed. The already quoted 

comment by R. A. Fairthoren can be cited as an expression of the 

attitude of some classifiers. 

"Future developments certainly should not upset any 
decision about relevance; if an item is relevant to some 

topic, it will always be relevant, though the relevance may 

17 
become unimportant and new relevancies may be added." 

Let us suggest that the intrinsic meaning or concept behind a word 
is a philosophical problem and cannot be dealt with operationally. 
Those aspects of a document which do not influence its environment (i.e. 
the library and the user) are of no practical significance because they 
cannot be observed, measured, or even proved to exist. 

To avoid adding further to this misunderstanding we shall avoid the 
use of the word 'relevance' in the rest of this paper. The frequency 
ratios used by this project will be termed 'measures of relatedness' . 
It is hoped that this term is less loaded with connotations of intrinsic 
meaning. 

1.35 Automatic Classification and Clumping Experiments 

After automatic indexing was proposed for the assignment of docu- 
ments to categories, it was only natural that the automatic determina- 
tion of the categories themselves should be tried also. This was done 
initially by borrowing two techniques from mathematical psychology — 
factor analysis and latent class analysis. Factor analysis is used to 
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discover the underlying factors which account for the performance of a 
group of people to a battery of tests. Latent class analysis is a 
procedure used to divide a group of people into disjoint sub-groups on 
the basis of their responses to a questionnaire. 

Latent class analysis for information retrieval has not yet been 

1 52 
experimentally tested. ' Borko's work with factor analysis was based 

f\ R 
on the occurrence of keywords in document abstracts. A correlation 

matrix of keywords versus keywords was formed and was factor analyzed, 

resulting in categories which had some resemblance to those manually 

selected for the same corpus. 

An even earlier attempt at automatic classification was tried by 

Needham and Parker-Rhodes in England. ' J9 » ii They called it clumping 

and produced a heuristic procedure which selected clumps of documents 

13 
from a file. Their work has been extended in this country by Dale 

and also by Bonner. 

Since clumping is the most closely related endeavor to the object- 
ives of this project of any to date, a slightly more extended description 
of the results will be given. A library collection is thought of as a 
network with the nodes representing documents and values assigned to 
the links (usually or 1 only). This collection is partitioned into 
two subsets, A and B. The sum of the links internal to A is denoted by 
AA and the sum of the links internal to B is denoted by BB. The only 
other links in the network are those which cross from set A to set B. 
The sum of these links is designated AB. 

A GR clump is defined as any set A which produces a local minimum 

13 
of the function F(A). 

F(A) = 



AA + BB 
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A more recent type of clump, the D clump, is defined as any set A 

12 
which produces a local minimum of the function G(A). 

G(A) = ** s 

V(AA)(BB) 

GR clumps are fairly easy to locate. Some additional restrictions 
must be placed on D clumps to make the definition useful since local 
minima of G(a) occur for quite unrelated sets of documents. The latest 
effort has been to find an initial set of items by some other method and 
then use the D-clump method to complete the set. 

Both the automatic classification and the clumping experiments are 
designed so that all of the classifying and indexing would be completed 
before the requests are processed. 

1.36 Systems Evaluation 

The most widely accepted method of evaluating the performance of 
information retrieval systems is currently through the recall and 
relevance ratios. The recall ratio is the percentage of relevant 
items that are actually retrieved and the relevance ratio is the percent- 
age of retrieved items that are relevant. 

In determining what is or is not relevant, recourse is usually 
made to an indexer or a user. Recent studies have shown that these 
people are able to agree among themselves as to how documents should be 
classified in at most 80$ of the cases. This "failure" of humans to 



index consistently has led some to try to find better automatic "non- 
judgemental" standards on which to validate relevance. 

If the primary objective of a library is in serving a given user 
population, then it is difficult to imagine that there could be any 
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criteria for relevance other than one based on those users. If, on the 
other hand, the function of a library is to set up a universal classi- 
fication system, then the user should certainly be eliminated as the 
standard on which system efficiency is evaluated. 

The idea that the users of a system can "fail" in classifying a 
document implies an intrinsic content in documents which one or more of 
the users has not recognized. A more practical outlook in keeping with 
the arguments of Sec. 1.3U is that these differences in indexing are 
only the normal result of individual backgrounds and interests. 
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CHAPTER II 
OBJECTIVE OF THIS PROJECT 

2.1 Brief Description of Project Objective 

Let us assume for a moment that we wish to design an information 
storage and retrieval system which is based on feedback from users. In 
this system each request for information is to consist of a set of one 
or more documents that the user has already found to be of interest and 
a second (possible empty) set of documents that he knows are not of 
interest. 

The purpose of each interaction of a user with the system is to 
transform a request of this type into a partitioning of the total collec- 
tion into two disjoint subsets — one containing all documents that are of 
interest to the user and the other containing those not of interest (the 
rest of the stack). This process is to be accomplished jointly by the 
user and the system. 

The feedback which the system stores for use in answering future 
requests is to consist of these file partitionings . A measure of the 
relatedness between any two documents based on their usage and co-usage 
patterns as found in the partitionings is to be utilized to facilitate 
the request-to-answer transformation. 

The document collection of such a system can be thought of as a 
network where each node represents a document and each link is given a 
value corresponding to the measures of relatedness between the two 
linked documents. 
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The objective of this research endeavor is to devise, test, and 
evaluate a procedure which will perform the transformation of request 
to answer partition for this type of retrieval system. 

In the above discussion we suggested for purposes of illustration 
a retrieval system based on file partitionings which are generated by 
the users of the system. Partitioning information of this sort would 
not be available for documents that have just been added to a file. 
Indeed, such information is not readily available for any file of docu- 
ments at the present time. 

There are, however, some types of partitionings which are available. 
Take, for example, the citations in an article. The author of an article 
selects for citation certain documents that he feels are pertinent to 
the article he has written. In a sense he is a special type of user of 
the library and has created a meaningful partition of the file. Other 
types of partitionings of the file could also be suggested. 

Usage information was selected for discussion here because it is 
an interesting and representative example of the larger class of parti- 
tioning information for which we propose to design a retrieval system. 
In the remainder of this chapter and in the next chapter we will, 
therefore, continue to talk in terms of the partitionings generated by 
users. It should be understood, however, that the type of retrieval 
system to be developed need not be restricted to this single type of 
partitioning data. 

In the next section we will present some arguments for and 
against information retrieval based on usage information. We will then 
discuss how usage information can best be represented and utilized. 
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2 .2 Value of Usage Information 

In the article already cited at the beginning of Chapter I, V. 
Bush suggested that an individual's personal information storage and 
selection system could be based on direct connections between documents 
instead of the usual connections between index terms and documents. 
These direct connections were to be stored in the form of trails through 
the literature. Then at any future time the individual himself or one 
of his friends could retrace this trail from document to document with- 
out the necessity of describing each document with a set of descriptors 
or tracing it down through a classification tree. 

In 1956 R. M. Pano suggested that a similar approach might prove 
useful to a general library. He proposed that "the concomitant use of 
documents by experts as evidenced by library records, and other similar 
joint events" might be a useful basis for document retrieval. 1 ^' 1 ^ His 
proposal evoked a number of adverse comments, two of which will be quoted 
here. 

2 .21 Objections 

A theoretical objection to basing retrieval on usage was raised by 
Y. Bar-Hillel. 

"A colleague of mine, a well-known expert on 
information theory, proposed recently, as a useful tool for 
literature search, the compiling of pair-lists of documents 
that are requested together by users of libraries. He even 
suggested, if I understood him rightly, that the frequency 
of such co-requests might conceivably serve as an indicator 
of the degree of relatedness of the topics treated in these 
documents . 

"I believe that this proposal should be treated 
with the greatest reserve. Although much less ambitious 
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than Tsube's proposal of an association dictionary, it is in 
many respects strikingly analogous to it and shares its short 
comings. The fact that a co-requestedness chain of documents 
can be easily followed up by a machine is not in itself a 
sufficient reason for making the assumption that this relation 
might be a useful approximation to the important relation of 
dealing-with-related-topics between documents . And one can 
think of many other easily establishable relationships between 
documents that stand a better chance of being a useful approxi- 
mation, e.g. co-occurrence of their references in reference 
lists printed at the end of many documents, co-quotation, and 



„2 
so on. 



The shortcoming of 'Taube's proposal' referred to in this quote is 
the familiar triangle argument. 

"Knowing that 'a* and 'b' co-occur... and that 'b* and 'c' 
co-oecur...what do we know about the connection between the 
'ideas' 'a' and 'c'? Clearly, nothing definite whatsoever..." 

What Bar-Hillel says is true also of hierarchal classification 
systems where the adjacency of categories a and b and of categories b 
and c proves nothing about the relationship of a and c. It is true of 
any system consisting of a set of items and characteristics that cannot 
be described by some type of metric space. 

On the other hand the fact that documents a and c are not related 
in every case when linked through a third document b is more of a hypo- 
thetical objection than a practical one. If, in fact, items with the 
a-c type connection are found to be related on the average much more 
frequently than items chosen at random, then the usefulness of this type 
of connection in document selection should not be overlooked. 

A second objection to Fano' s suggestion was raised by C. N. Mooers. 
It is a practical instead of a theoretical objection. 
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"To provide feedback for improving machine performance 
Fano and others have suggested the use of statistics of the 
way which people use the library collection. Though the 
suggestion points in the right direction, I think this kind 
of feedback would be a rather erratic source of information 
on equivalence classes, because people might borrow books on 
Jack London and Albert Einstein at the same time. Although 
this difficulty can be overcome, there is a more severe problem. 
Any computation of the number of people entering a library and 
the books borrowed per day, compared with the size of the 
collection shows, I think, that the rate of accumulation of 
such feedback information would be too slow for the library 
machine to catch up to and get ahead of an expanding technology." 

Mooers' objection assumes that the capability of accepting feedback 
from the user is to be superimposed on a conventional library structure 
and that it will have little net effect on the frequency of use of that 
library. Let us accept these assumptions for the moment and suggest 
some reasons why usage information would still prove profitable. 

First, libraries might well find it helpful to share usage patterns 
and thereby increase the total information available to any one library. 
Second, the well used documents will have plenty of usage statistics and 
be well 'indexed', while unused books will have no statistics—a seem- 
ingly equitable arrangement. Third, even the information on one usage 
of a document may prove more valuable than the information supplied by 
the indexer of that document. Fourth, usage information is not pur- 
ported to be a cure-all which will replace all of the current types of 
selection information. It is felt to be a supplemental source of 
selection clues which should grow in importance as more user feedback is 
collected. 
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Now let us return to the initial assumptions and note that the 
number of people who enter a library is by no means an indication of 
the amount of time spent in the study of printed material. It is merely 
an indictment of current library practices. If, in fact, information 
were made available to research workers right in their offices through 
the type of computer time -sharing system described in Section 1.31, then 
the amount of feedback available from users should radically change. 

2.22 Supporting Arguments 

Thus far in this section we have cited two early proposals that 
document selection be based on user feedback. We have quoted both a 
theoretical and a practical objection to such an approach and have 
attempted to answer these objections. Let us now turn to some of the 
positive arguments favoring user feedback which, to this author at leas^ 
are compelling reasons why document retrieval should be based on infor- 
mation from the user. 

The first argument has already Deen alluded to in Section 1.26. 
In this section the need for dynamic indexing was observed. It was 
noted that it is impossible for an indexer to foresee all of the possible 
applications of a paper at any given point in that paper's history and 
especially not just after it is written. 

To account for the changing relationships and new applications of 
papers in a collection, a library must be supplied with information. 
Such information regarding the changing nature of the corpus must come 
from the three participants in the library process- -author, indexer, 
and user. 
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To require indexers to periodically re-index the collection would 
be financially impossible. Many libraries find it difficult to even 
initially index each incoming document. 

The textual information placed in the document by the authors 
offers little help also. Take, for example, a research worker who 
publishes a new discovery. A terminology which eventually evolves to 
describe that discovery may be markedly different from the language of 
the initial paper. And it would be a rather momentous task to develop 
a thesaurus which could connect the groping language of the basic paper 
with the codified terminology which eventually results . 

Thus, the user is left as the one participant in the library 
system who is continually interacting with the collection and could 
introduce dynamic indexing into the system. 

Let us note at this point that citation information in newly added 
documents represents a specialized type of user information (the author 
acting as a user of the old file), and as such can act in the same way 
as usage information to give the system a changing indexing structure. 
Some other advantages of this source of indexing information were noted 
in Sec. 1.33. 

The second argument in support of the utilization of user feedback 
concerns the quality of the indexing which results thereby. The advant- 
age of having the indexing done by people actually immersed in a given 
research area can hardly be overemphasized. Hitherto neglected refine- 
ments and distinctions can be made, the structure of the field as the 
actual worker sees it can be established, and many unintentional 
blunders can be avoided. 
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It should be noted that the quality of indexing by usage is a 
controllable parameter, Take , for example, the users of articles in 
the Physical Review , This group of people represents a highly know- 
ledgeable and motivated segment of the population which should be able 
to form valid links between documents. If, however, the quality of the 
resulting indexing is still insufficient, the system could be designed 
to accept feedback from only a segment of the population--say the faculty 
but not the students. This could even be made a parameter specifiable 
by the user so that he could use the feedback from that segment of the 
population which most closely fitted his own background. 

A third reason for indexing by user feedback is that it may be 
possible to do it as a by-product of normal library use and thus avoid, 
to some extent, the high cost of indexing which currently burdens a 
library. 

2.23 Collecting Usage Information 

Let us now discuss the problem of how the intellectual decisions 
needed from the user can best be obtained. The sets of citations found 
in articles form one readily available source of sets of documents that 
have been judged mutually pertinent. The data used by the experimental 
portion of this project was taken from this source. (See Sec. 6.22) 

Let us consider for a moment whether a retrieval system could be 
designed which was based on usage data of the type described in Sec. 2.1. 
One major difficulty would be to devise some way of encouraging the 
user to supply the system with the data needed. Some possible ways 
this might be accomplished are the folowing: 
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1. The user finds that the system automatically disseminates to 

him new articles of interest if he has provided profiles of 
his interests in the form of sets of papers of known interest. 

2. The user finds that in interacting with the retrieval program 

he converges on papers of interest more rapidly if he tells 
the system whether each paper presented is of interest or not. 

3. The user contributes sets of related papers to the system 

because he wishes to improve its usefulness to himself and 
others . 
k. Certain users are provided monetary remuneration for supply- 
ing the system with sets of related documents. 

2.3 The Purpose of Measures of Relatedness 

The next question that arises after one has accepted the idea that 
information selection might appropriately be based on some type of usage 
data concerns the form that this data should be expressed in. One 
might propose that each usage set be treated the same way as a subject 
heading or descriptor set with its label being the name of the user 
that generated the set. Under this scheme one might retrieve all of the 
papers of interest to a given user or all of the papers whicn have been 
found of mutual interest with a selected paper. Indeed the ability to 
answer these types of questions is a valid capability to equip a 
retrieval system with. 

However, there are some significant differences between the sets of 
papers generated by users and the sets of papers generated by some type 
of indexing scheme. First, there is the fact that any given paper occurs 
in, at most, only a handful of indexing categories, while it might 
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possibly occur in a very large number of user sets. Second, there can 
be any number of user sets centering around a given area of research, 
but this area would be normally covered by only one subject category. 
Third, usage sets would be continually added to the system, but new 
categories would be added infrequently. 

All this adds up to the fact that users who attempt to extract 
information from usage files with normal matching techniques will 
probably be overwhelmed with the non -uniform, massive, fluctuating 
nature of this type of data. 

Some type of statistical measure is needed which will combine and 
sunmarize the results of many user Interactions. The specific charac- 
teristics which this measure should have are discussed in Chapter III. 
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PART TWO: THEORETICAL DEVELOPMEHT 

The three chapters of this part describe the theoretical 
model on which the research project is based. There are three 
closely related components of the model. 

Chapter III: Measure of Relatedness 
Chapter IV: Cluster Definition 
Chapter Y: Search Procedure 

The experimental system which was devised to test the 
applicability of the model to a real world situation will be 
described in Part Three. It is hoped that this organization 
will help in keeping the abstract ideas of the model separate 
from the particular physical implementation which was developed 
to test them. It may be somewhat misleading, however. In 
actuality the model was not completely developed before the 
implementation began. It was continually revised and improved 
as various versions of experimental systems were programmed, 
tested and then discarded. What is described in this and the 
next part is the current model and test program. 
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CHAPTER III 
MEASURE OF RELATEDNESS 

The first step in establishing the conceptual basis of the research 
project is the selection of a measure of the relatedness between docu- 
ments. To this end a sample space will be defined and a probability 
distribution assigned to it. Then a measure based on these probabil- 
ities will be selected and some of its characteristics noted. Finally 
the document network generated by the measure will be described. 

3.1 Sample Space 

In order to motivate the choice of our mathematical model, we 
regard each interaction of a user with a library as a partitioning of 
the stack into two disjoint subsets of documents: one containing all 
the documents of interest to the user and the other containing the rest 
of the documents. Each interaction is assumed to have a single purpose 
in the sense that all documents of interest are of interest for the 
same purpose. 

There are theoretically 2 such partitionings possible for a stack 

of n documents. How let us think of a discrete collection of 2 points 

22 
(a sample space ), each representing one of the possible partitionings. 

These points can be identified by n-bit binary numbers, x^...x , where 
x. is 1 if the i document is in the subset of interest and if it is 
in the subset of no interest for the partition in question. (A super- 
script will be used to denote the value of a variable: x.=x =1.) 
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For a given user population and document collection a probability- 
distribution p(x.....x ) can be assigned to the sample space. Each 
pCx-^.-.x ) may be regarded as the probability that a user chosen at 
random from the population will partition the document collection with 

the partition X.....X . 
In 

Compound events can be defined in terms of the simple events repre- 
sented by the sample points. For example, p(x-), the probability that 
document 1 will be of interest to some user can be obtained by summing 
the probabilities of all points for which x =1. 

p(x 1 )=2_ p(x ]L x 2 ...x n ) 

Xrt • • • X 

d. n 
Similarly p(x x„), the probability that documents 1 and 2 will be 
found to be of interest Jointly, can be obtained by summing up the 
probabilities of all points for which x =1 and x =1. 



p(x 1 x 2 )= 2. p(x 1 x 2 x 3 ...x n ) 



y " n 

In the sections that follow we will want to talk not only about 

the abstract theoretical values of these probabilities, but also about 

their estimated values as obtained from experimental data. Suppose that 

there is information available on a large number of partitionings of a 

library. Let us make the following definitions. 

N: Total number of partitionings of the library that are 
available. 

N : Number of partitionings in which document i occurs in the 
subset of interest. 

N^,: Number of partitionings in which both documents i and j 
occur in the subset of interest. 

Based on these N's estimates of the probabilities can be made as 



follows : 

pU^Ji 

pUjXj)^ i 



etc. 

The partitioning data employed in these estimates may result from 
experimental evidence other than actual user interactions with the stack 
of documents in question. For instance, one might partition the stack 
on the basis of whether or not the documents cite a given document, or 
on the basis of whether or not they contain a particular word in their 
titles. As a matter of fact, the experimental system described in 
Chapter VI uses parti tionings based on whether or not the documents cite 
a given document because these were readily available while actual usage 
data were not. 

This use of another type of partitioning data (other than usage 
data) by the experimental system is considered acceptable here since 
the purpose of the experimental portion of the project is to permit an 
investigation of general properties of the theoretical model that should 
be largely independent of the precise values of the probability esti- 
mates. 

3.2 Criteria for Selecting a Measure of Relatedness 

We have already noted in Sec. 1.3U that a number of measures of 
'relevance' have been suggested for us in information retrieval. Some 
of the more widely known of these measures are tabulated in Appendix A. 
The differences between them are partially due to the fact that they 
were designed for different purposes and partially due to the varied 
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backgrounds of the people who proposed them. Some of them have a theo- 
retical basis in probability, statistics, or information theory; others 
are of an ad hoc nature. 

In Sec. 2.3 we discussed why a measure of relatedness was needed 
for this project. The purpose of such a measure is not to rate the 
individual or joint merit of the documents in the stack, but rather to 
represent their relationship in terms of frequency of use and co-use. 
To this end it was decided that the measure selected should have the 
seven characteristics listed below. 

Mot all of the measures of Appendix A are expressible in terms of 
the theoretical probabilities of the last section. Therefore, for pur- 
poses of comparison we shall express these seven criteria in terms of 
the frequency counts on which the estimated probabilities are based. 
The H' s are as defined in the last section, C is the measure of related- 
ness between documents 1 and j, and R a s|_ means that R monotonically 
increases with S as T is held constant. 

1. Co-occurrence Factor C^H. .1 

The measure should monotonically increase with the number of 
co-occurrences in the subset of interest of the documents in question if 
all other factors are held constant. Consider, for example, a pair of 
documents (i,j) and another pair (r,s). If the H's are the same for 
both pairs except that *ij>* rg j then the relatedness between i and j 
should be greater than the relatedness between r and s. 

2. Other Usage Penalty Factor C »1/H. 



H,H i ,H iJ 

The measure should monotonically decrease as the number of 
occurrences of one of the documents increases- -all other factors being 
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held constant. That Is, if document i is used a larger number of times 
but not in conjunction with document J, then the relatedness between i 
and j should decrease. 



3. Co-occurrence Ratio Factor CssH. ./N. 

±y i 



N, Hj 



If the ratio or fraction of the number of co-occurrences of 
document i with document j to the total occurrences of document i in- 
creases, the measure should increase also. Note that this criterion is 
not a consequence of 1 and 2. 

U. Function of Probability Estimates Only C(N./H, N /H, H. ,/N) 

i J ^ J 

The measure should depend only on the ratios of frequency 
counts which are used to estimate the probabilities. As long as these 
ratios remain constant the measure should not change. 

5. Statistical Independence 

The one bench mark that is available for measures is the 
statistical independence of the events in question. It would seem log- 
ical that if the occurrence of two documents are statistically indepen- 
dent, their measure of relatedness should have the value 0. 

6. Theoretical Basis 

A measure that has a solid theoretical basis Is to be pre- 
ferred over one which has been developed by trial and error. 

7. Ease of Use 

The best measure is a simple one that is easy to calculate 
and manipulate. 

3.3 Selection of a Measure 

Let us now evaluate the measures of Appendix A in terms of the 
criteria of the last section. Measures (l) and (2) have no theoretical 
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basis (Criterion 6) and are not for statistically independent events 
(Criterion 5). The Chi Square Formula (5) is not expressible in terms 
of the probability estimates (Criterion k). The value of the Cosine 



Formula (6) for statistically independent events isyp(x.x,) which is 
neither nor even constant. The Average Correlation Coefficient (7) 
does not satisfy Criteria 1, 2, or 3. 

This leaves Measures 3, k, and 8 which meet (at least partially) all 
of the criteria listed. Measure 8 was selected for this research pro- 
ject because its foundation in information theory has led to some very 
interesting and useful results. 

The use of Measure (8) in document retrieval was first proposed by 

1 9 
R. M. Fano . In its more general form it expresses the degree to which 

a set of events x..,...,x , are correlated in terms of their individual 
1 r 

and joint probabilities. 

, 1 l, p(x ...x ) 
C(^...x*)-log * * (1) 

p(x 1 ...p(x r ) 

The base of the logarithm function used in the formula and through- 
out the remainder of this paper will be assumed to be 2. This will mean 
that the unit of correlation will be the "bit". 

If only 2 events, i and j, are considered, then the coefficient is 

equal to the mutual information, l(x.jx ), between the 2 events as de- 

20 
fined in information theory . 

1 i il P(x,x ) 

pUjJpUj) 

Let us relate the probabilities of formulae (l) and (2) to the 
probabilities of document usage defined over the sample space of the 
preceding section. The event x is now the occurrence of document i in 
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a user's set of interest. The correlation C(x.x.) is the degree to 
which the two documents, i and j, are taken to be mutually pertinent. 

The approximation to C in terms of the estimated probabilities will 
be denoted by the symbol C. 

*;<') - ios ft » log l^ii- . c (*w> 

1J «^J»(^) Vj 1J 



3 «U Practical Considerations 

In order to calculate the measure of relatedness C for any arbi- 
trary set of documents selected from a collection of n documents, one 
would have to estimate and perhaps store at least 2 " probabilities. 
This is, of course, out of the question for any reasonably-sized docu- 
ment file. If C is to be used, seme approximating simplification must 
be made. 

Let us now note that this correlation coefficient C can be expanded 

20 
in terms of mutual information terms as follows'- v : 



C(x ...x ) - ) I(x ;x.) - ) I(x.;x,jx.) * ... 

1 r i7j-i i J i,j,k«l i J * 



where 



p(x 1 x 2 ) 
I(x 1 ;x 2 ) = log 



pfx^pfxg) 

pU^Xg toU^ Jpfx^ ) 



l(x,;x ;x,) » log 

pfx^pCxg )p(x 3 Jptx^x^ ) 

etc. 
It has been proposed that C be approximated by the first summation 
in this series, and that the other summations be dropped as higher- 
order effects. There are some theoretical reasons which would lead one 
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to believe that this would result in a good approximation to C 20 . How- 
ever, we shall rest our case here on practical necessity and not go into 
the details of these theoretical arguments. 

c(£..«i>-I K«X) ■ I log P(x fj } . 

(tfj) (i/j) * i 
For this approxiaation one need only estimate and store n univariate 

and ( n ) bivariate probabilities in order to obtain the correlation 

between events and subsets of events. 

Through the same approach one can obtain an approximation to the 

correlation between any two subsets of events — 



C[(4...xJ)(yJ...yJ:)]~ X X(«£«J> 



i,J-l * J 

If these subsets overlap then one or more of the terms in the 
series becomes the self correlation of the event. 

1 lv P(*i*i) 1 

C(r£cJ) - log 1 1 i - log 



P(xj)p(xj;) p(xj) 



3.5> Characteristics of the Measure for Document Pairs 

The measure of relatedness is for two statistically independent 
events t 

p(xjxj) - p(4)p(xj) 

For events occurring together less often than if they were statistically 
independent, C is negative and for events occurring together more often 
C is positive. 

Theoretically the range of C is from - ooto + o». However, there is 
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a statement that can be made about the upper bound. Since pCx^x.) cannot 

be larger than p(xT") or p(x^) the following inequalities hold: 

r 1 

,11, <10g t 

p(xx) I / Is 

C(x^) = log H-T— < ± 

p(x J } 

The quantity log[l/p(x.)3 is termed the self information of x^^ in 
information theory . Thus, the correlation between two events is always 
less than or equal to the self information of either event. Let us indi- 
cate this range on the simple graph of Pig. 3.1. 



<z ffi / / / / / / / / / ' 



T—r 



-00 



/ / / / ZZ| >c 



Max[log(l/p(x*))] 
Fig. 3.1. Range of measure of relatedness. 

Some additional comments about the range of the measure can be made 
if we consider "&, the approximation to C based on the estimated proba- 
bilities. The maximum positive value of C is (log H) and occurs when 
N., H., and N . , all equal 1. Its minimum value other than -»ls (2-logH) 
and occurs when N.. is 1 and N. and M. are N/2. This range is shown in 
Pig. 3.2. 



-fj- 



y / j / / 



~°° 2 -log N 



/ ' ' ' ?H >c 



log H 



Pig. 3.2. Range of approximation to measure of relatedness. 

For the test data utilized in the experimental portion of this 
project (see Sec. 6.l) it was found that the C' s were either -eo or had 
some positive value (see Pig. 3.3). The lower limit of (2 -log N) in 
Pig. 3.2 is changed in Fig. 3.3 since all of the X ± ' s of the test data 
are much less than M/2 . The new minimum of G occurs when Nij=l and Ni 
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and N. are maximum (called (N. ) ). 
J i max 













1 // 
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/J 
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(1 1„„ 


N 




> 




i max 





Fig- 3.3. Range of measure of relatedness for test data. 

The range for the test data is due not so much to the fact that the 
occurrence of the documents in the test file are never statistically 
independent as to the fact that such statistical independence can only 
be detected with a very large data base. Consider documents i and j 
with p^), p(x.) - 0.0001. If x ± and x* are statistically independent, 
then p(x i x,)=10" . In order for any of the probability estimates to be 
this small we would need at least 10 partitionings. Many, many more 
partitionings than this would be needed if one wanted to have accurate 
estimates of the occurrences of such rare events. With fewer partition- 
ings these events either never occur, resulting in p(x,x^)«0, or do occur 
with the estimate for p(x x ) being larger than it should be. This is 
the phenomenon observed for the test data. Even if there were correla- 
tions that were or slightly negative they would be pushed to - oo or to 
some positive value because of the limited number of partitionings 
available. 

It is conjectured that this will be the situation in most practical 
cases for some time to come. In a very large document collection 
(10 -10 items) the probability of occurrence of any one document is 
probably small, say 10" or 10 . This would require a file of 10 6 to 
10 partitionings to measure statistical independence which would take 
considerable time and effort to collect. In a small document collection 
the probability of occurrence of any one document could be larger but the 



number of partitionings available would undoubtedly be less also. 

It should be pointed out that this measure will assume some value 
for every pair of documents in the stack (except perhaps documents that 
have never been used). Even two documents that have never co-occurred 
together (N .=0) are related by the value -00. 

A few comments should be made about the value -oo. It is not a 
realistic value for the correlation between most documents because it 
implies that there is absolutely no chance of two documents co-occurring. 
As has already been pointed out this arises because the probabilities may 
end up exactly zero. A much more practical and reasonable approach to 
the problem would be to make all correlations between document pairs for 
which N. -0 equal to some finite negative value instead of - oo. More 
will be said on the choice of this negative value (K) later (Sec. U-5)- 

J / / / / 7-4 > c 



(j> log — — ^ log N 



max 



Fig. 3. it. Revised range of measure for test data. 
Another feature of the selected measure is that it is non-directional. 
That is, the value of the measure from document i to j is the same as 
from j to i. 

3.6 Document Networks 

It has been suggested that measures of the relatedness between docu- 
ments should be metrics . This would require that a measure C exhibit 
the following properties: 

(1) C(x,x)=0 

(2) C(x,y)>0 (if x/=y) 

(3) C(x,y)-C(y,x) 
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(h) C(x,y)+C(y,z)>C(x,z) 

The measure under consideration does meet property (3). It might 
conceivably be made to fit properties (l) and (2) through some type of 
normalization or restriction. There appears to he no way to make it 
have property (h), the triangle inequality. Indeed, it would be rather 
disturbing to this author if it did have property (h). 

Bar-Hillel has pointed out in the comment cited in Sec. 2.21 that 
many of the important aspects of a document collection (except physical 
location) cannot be made to satisfy the triangle inequality an d cannot, 
therefore, be represented by metrics. His conclusion was that measures 
derived from these features (joint usage, common citation, etc. ) are use- 
less. Our conclusion is that such measures should not be required to be 
metrics. 

The idea that a metric space is the appropriate model for a docu- 
ment collection is rejected here. If one desires a model to aid in his 
mental picture of a document collection, a simple network is suggested. 
Each document can be considered a node and the link between two nodes 
can be assigned the value of the measure of relatedness between the 
corresponding documents. It has already been pointed out that the 
measure of relatedness chosen links every node (document) to every other 
node. It might, therefore, be easier to visualize the sub-network con- 
sisting of only positive links. This is the visual picture found most 
helpful to the author. 

Thus far we have considered the problem of generating a document 
network from a set of probabilities. Let us now consider the reverse 
process. If one draws a document network and arbitrarily chooses the 
values to be assigned to the links, can a set of probabilities be found 
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which could have generated the network? This question is of interest 
because if there is only a certain class of networks that are realizable 
from sets of probabilities, then we need focus our attention only on that 

class. 

Theorem . For every document network (with the restriction 

that the values of the positive links be finite) there is at least 

one set of probabilities which could have generated it. 

Proof . The first step in proving this theorem will be to select a 
set of values for the elementary probabilities, p^. ..x q ). It will then 
be shown that the set selected yields the correct values for the links 
of the network in question and forms a valid set of probabilities (i.e. 
each value is in the range to 1 and their sum is l). 

Before proceeding let us define the following symbols, 
n: number of documents in the network (n>2). 

C(x]'x 1 ): value of the network link between documents x ± and x. 

■*■ J 

C : maximum value of C(x. x,). 
max i J _q 

k: the lesser of the two quantities: (l/n) and (l/n)2 
It will also be convenient to introduce at this point one additional 
notation convention. Let us allow the values of the variables in the 
p(x ...x )'s which differ from to be specified by a statement following 

a colon as well as by superscripting. For example: 

/ ,\ / 1 Ox 

p(x 1 ...x n :x i -l) = p(x 1 ...x._ 1 x ± x i+1 ...x n ; 

We are now ready to state the values for the elementary probabil- 
ities, p(x 1 ...x ). Four possible classes will be considered. 

(1) All p(x ,,,x ) for which three or more x's are 1. 

p(x. ...x : at least 3 x*s»l)«0 
1 n 

(2) All p(x n ...x ) for which two x's are 1: 
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C(x}x]) 



p(x ...x :x.,x =l)=k 2 2 X J for all i,j (i^j). 

(3) All p(x.,...x ) for which one x is 1: 

r 1 n 

r- C(x^) 
p(x 1 ...x n :x i =l)=k-^ L 2 J 



j/i for all i. 



(h) The p(x n ...x ) for which no x is 1. 
1 n 



P (x°...x°K-nk+(kV2) Z_ 2 x J 

i,j=l 

The motivation behind the selection of these values will become 
clearer as the discussion proceeds. It may be helpful, however, to note 
three of the underlying ideas at this point. 

(1) Each p(x^) is to have the same value. 

p(x i )=k 

(2) The value of the p(x.)'s is to be chosen so that the p(x.x )'s 

can be adjusted to give the desired C(x.x.)'s. 

P(x i x )=k 2 

(3) The only elementary events that are allowed to occur are those 

with zero, one or two documents in the subset of interest. 
Let us prove that the elementary probabilities as selected above 
generate the correct values for the links of the document network. Pre- 
liminary to doing this we will determine the values of the p(x.)'s and 
p(x t x )'s. 

P(x i )= l_ pCx^.-x^) 
all p ' s for 
which x =1 



(x,...x :x =l) +Z p(x.,...x :x.,x.=l) 

1 n i . , ^ 1 n 1 ,i 



J2 1 

an 



Sh 



1 1\ n ,/lli 



2 V C(X 1 X ? 2 f" C( 4 X ? 
= k-k 2 L 2 1 J +£■ L 2 J 

j=l J-l 

p(x*) = k for all i. 

p(x^) - Y. p(x l-" X n ) 

all p' s for 
which x. , x =1 

= p(x r ..x n :x.,x.=l) 

p(xV) = k 2 2 i J for all i,j (i/j). 



k 2 J 



log 



(k) (k) 

«= Cfx^ 1 ) for all i,j (i£}). 

i J 

In order for the set of values selected for the pCx^.-x^'s to 
form a valid set of probabilities, their sum must be 1. 

s = X p ( x r ,,x n^ 

over all x' s 

n n 

= 1/2 51 p(x 1 ...x n :x i ,x-l)+Z p(x 1 ...x n :x i )-l+p(x°...x°) 
i,j=l 1=1 

" $- ccx^ 1 ) 9 v c ^ x M) 2 t c(x i x i } 

= (k 2 /2)^- 2 " J + nk-k 2 2- 2 X J + l-nk + (k 2 /2).2_2 i J 
i,j=l i,J=l V" 1 

S = 1 
We must also prove that the values selected for the p^.-.x^'s 
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are in the range to 1. The values for the first class of probabili- 
ties, p(x....x :at least 3 x's «l), are all and thus automatically in 
the range. The values assigned to the probabilities of the second class, 
p(x....x :x.,x «l), can be shown to be in the range by the following 
argument. 

-C -C(x*x«) 
k<(l/n)2 ""^(Vn)? l J 

C(x 1 x 1 ) 
k2 i J ^(l/n) and k<(l/n) 

.*. k 2 2 1 J <(l/n) 2 



0^ 



£k 2 2 i J < 1 



Next let us show that the values assigned to the probabilities of 

the third class, p(x 1 «..x :x =l), are in the correct range. 

o a. C(xJ-x^) 
k 2 JT 2 i J <k<l/n<l 



1 li 
k-k" 
J-l 
iH 



f C(x^) 
k-k /- 2 x J >k-k(n-l)(l/n)>0 
j-l 


Finally let us check the range of p(x.....x ). 



n C ( x - L x - L ) 
l-nk + (k 2 /2) 2. 2 i J <l-nk + (l/2)(n)(n-l)(l/n)-l-^ - -|-<1 

JL cCx^ 1 ) 
l-nk+(k 2 /2) Z_ 2 1 J >l-nk>l-n(l/n)-0 

ifj QED 
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CHAPTER IV 
DOCUMENT CLUSTERS 

In the last chapter a measure of relatedness between documents was 
defined and a document network based on the measure was described. The 
next step to be taken is to formulate a definition for what constitutes 
a subset (cluster) of highly inter-related documents based on this 
measure. The purpose of such a definition is to provide the user who 
has requested information from the system with a set (cluster) of papers 
which is judged to be related to his interest. 

The exact form that a request for information can take and the pro- 
cedure used to translate a request into an answer cluster will be de- 
scribed in Chapter V. The way a cluster is obtained, modified, and 
stored in the experimental system devised for this project will be 
covered in Chapter VI. In this chapter we shall confine our attention 
to what constitutes an appropriate cluster of documents. Two types of 
clusters will be defined and analyzed, and certain modifications will be 
described which make one of the definitions acceptable. 

It.l Local Maximum Clusters 

The cluster definition which was first proposed and tested turned 
out to be the one which was eventually selected for this project. Let 
us formally define it and then discuss its -characteristics. 

In this definition and in the remainder of this thesis we will find 
use for the following set operators. 
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[j: Set union--(A^JB) is the set of all documents in set A or in 

set B. 
(\: Set intersection — (A (")!$) is the set of documents in both set A 

and set B. 
CT: Set inclusion--(ACTB) means that the set A is included in the 

set B. 
X: Set complementation — X is the set of all documents not in X. 
Definition: Local Maximum Cluster 

A local maximum cluster is defined to be any subset of docu- 
ments X =?(x , ...,x ) for which both of the following conditions 

1 r 
hold. 

1. Every document x. in X is positively correlated to the 
remainder of X. 

C[x 1 (X a O^)] >0 for all x^X^ 

2. Every document x. not in X is negatively correlated to X . 

C(x,X )<0 for all x.CX". 

j a'- j a 

(Note that zero is arbitrarily classed as a negative value.) 

A local maximum cluster is so named because every possible single 

change (addition or deletion) to the cluster will result in a decrease 

in its internal correlation. The internal correlation C(X) of a subset 

X is defined to be the sum of the links whose ends both terminate in the 

subset. If X is a cluster, then 

C(X )>C(X Q ) for all X„ which differ from X 

a p p a 

by a single document. 
Five specific characteristics of local maximum clusters have been 
selected for discussion below. 

Size. The average size of the clusters produced by the local 
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maximum definition is very much a function of the correlation assigned 
to document pairs that have not co-occurred together (^j" )' It has 
already been noted that although this correlation, K, is -oo by the 
formula, some finite value is more appropriate (Sec. 3«lU). If K is 
made positive, then there will be only one cluster consisting of the 
total file. If K is made just slightly negative, then the clusters 
formed will be disjoint and consist of all documents connected by one or 
more paths of positive links. If K is made very negative, the only 
clusters will be those sets of documents wherein every document has co- 
occurred with every other document. 

Overlap . It is fairly obvious that local maximum clusters can over- 
lap. Consider the network of Fig. U.l in which all the links shown have 
the value +5 and all the links not shown have the value -6. The two 
local maximum clusters, (x^x ) and (x^x^x^) overlap through x^. 

Links shown are +$ 
Links not shown are -6. 

Pig. U.l. Network with overlapping clusters. 

Coverage . The following simple theorem shows that local maximum 
clusters may not cover all the documents in the network. 

Theorem. Document networks exist which have documents that are 

not included in any local maximum cluster. 

Proof. First consider a document that has never co-occurred with 
any other document. Such a document does not prove the theorem because 
it is included in a cluster which consists of only the document itself. 

Now consider the network of Fig. 1.2. The only cluster is 
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(jOjX^x, x^). The document x cannot form a cluster by itself since x„ 
and x, are positively correlated to it. It cannot form a cluster with 
x» and x, since Xj and x^ are positively correlated to the set (x..x ? x, ) 
with the value 5 + 5-6=li. Thus x occurs in no cluster. QED 



Links shown are +5. 
Links not shown are -6. 

Fig. U.2. Network with a document (x.) in no cluster. 

Although local maximum clusters do not cover all possible documents 
in a network, one is at least assured of the following-- 

Theorem . Every document network contains at least one 

local maximum cluster. 

Proof . The proof will be constructive. A local maximum cluster 
can be formed by successively making single changes (additions or dele- 
tions) to a subset of documents as outlined in the following 3-step 
procedure. 

1. PicK; a document at random as the initial member of the subset. 

2 . If every document outside the subset is negatively correlated 
to the subset and every document inside the subset is positvely corre- 
lated to the subset, then quit. The local maximum cluster has been 
found. 

3. Otherwise either add a positively correlated document that is 
not in the subset or delete a negatively correlated document that is in 
the subset. It doesn't matter which is done, but only one change must 
be made. How return to step 2. 

This procedure is assured of termination if the document set is 
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finite because step 3 always increases the internal correlation (sum of 
the internal links) of the subset being formed. There is, of course, an 
upper limit to the internal correlation of any finite set of documents. 

QED 

Structure . Local maximum clusters can form the type of hierarchal 
structure indicated by the following theorem. 

Theorem . A local maximum cluster can be a subset of 

another local maximum cluster. 

Proof . Again we can use an example to prove the theorem. In the 
document network of Fig. U.3 there are five local maxima: 

(x^), (xgX^), (^x^)* ( x i x k)> (x^gxy^). 
The first four of these are subsets of the fifth. QED 



Links shown are +5. 
Links not shown are -6. 



Fig. It. 3. Network with hierarchal cluster structure. 

Relatedness . Now consider the problem of whether local maximum 
clusters form well related sets. 

Theorem . Totally unrelated subsets of documents can occur 

together in a local maximum cluster. By totally unrelated we 

mean that no document in one set is positively correlated to a 

document in the other set. 

Proof . This theorem can be proved by another simple example. The 
set (x.,x 2 x,x. ) of Fig. k.k forms a cluster and yet there are no positive 
links between the set (x 1 x-) and the set (x,x, ). QED 
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1) \y) Links shown are +7. 

/xj\ (xT\ Links not shown are -3. 

Pig. k*k* Cluster containing unrelated subsets. 

The inclusion of unrelated subsets in the same cluster is considered 
an undesirable characteristic for a cluster to have. The reason why this 
is so involves the design of the procedure of Chapter V. It was decided 
that the procedure could be greatly simplified if one were to assume 
that each request for information from the system has only one purpose. 
A person who has several areas of interest on which he desires informa- 
tion is expected to make a separate request for each area. It follows 
that if each request has a single purpose, then the document clusters 
which are to answer these requests should not be divisible into unrelated 
subsets. 

It. 2 Subset Clusters 

In an attempt to keep completely unrelated sets of documents from 
becoming part of the same cluster, a definition was devised based on the 
addition of subsets or the deletion of subsets of documents as opposed 
to the single changes allowed in the local maximum definition. This 
definition was accepted as the one most suitable for this project for a 
number of months. In this section we shall describe it, note its charac- 
teristics, and explain why it was finally discarded. 

Definition 1; Subset Cluster 

A subset cluster is defined to be any set of documents 

X =(x >...,x ) for which both of the following conditions 
a a^ a r 

hold. 
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1. Every subset of documents Xg included within X is 
positively correlated to the remainder of X . 

C[Xp(X a r|Xj)]>0 for all X^CX^ 

2. Every subset of documents X^ external to X. is 

pa 

negatively correlated to X . 

C(X X )£o for all XCf. 

pa pa 

It is worth noting that Condition 2 of the local maximum cluster 
definition is equivalent to Condition 2 above. If each document external 
to X is negatively correlated to X , then certainly all external subsets 
are negatively correlated to X . Conversely if each subset is negatively 
correlated to X , then, of course, single documents, being subsets, are 
also negatively correlated to X . It should also be pointed out that all 
subset clusters are local maximum clusters but not vice versa. 

Mext let us present an alternative definition of a subset cluster. 

Definition 2; Subset Cluster 

A subset cluster is defined to be any set of documents 

X =(x , ...jX^ ) for which both of the following conditions 

hold. 



a °1 



1. The internal correlation of X as defined in Sec. U.l 

a 

is greater than the sum of the internal correlation of the dis- 
joint subsets of X created by any arbitrary partitioning. 

r 

C(X )>2_ C(D ) for all partitionings in which 

i-1 (D 1 U...UD r )-X o and D i OD J - null set. 

2. The sum of the internal correlations of X and some subset 

a 

X external to X is greater than or equal to the internal correla- 
tion of the set formed by adding X„ to X . 

p a 



63 



C(X a ) + C(X p )>C(X a UX p ) for all X p CX a - 

Theorem . Definition 1 and Definition 2 for subset clusters 

are equivalent. 

Proof . The equivalence of the second conditions of both definitions 
is fairly obvious. The equivalence of the first conditions requires some 
verification. 

Let us assume that Cond. 1 of Def . 2 holds and partition the 
clusters into two subsets. 

c(x a )>c(Xp)*c(x a nxp 

But: C(X a )-C(X p )K!(X a nx^) + C[(X p )(X o nxJ)] 

.'. c[(x ft )(x nxT)]>o 

Pip 
This last result is Cond. 1 of Def. 1. 

How let us assume that Cond. 1 of Def. 1 holds and partition the 
cluster into the disjoint subsets D.,...,D . By Def. 1: 

C[(D i )(X a n5^)]>0 for all D r ...,D r 

But: 

r r 

C(X> E C(D.)+l/2 E C[(D.)(X Hd-)] 

i«=l 1 i-1 i a l 

r 

i-l x 
Thus if Cond. 1 of Def. 1 is true, Cond. 1 of Def. 2 is also. QED 
Let us discuss now some of the characteristics of subset clusters. 
The comments and theorems on cluster size, overlap and coverage, which 
were made in Sec. b.l for local maximum clusters, hold for subset 
clusters also with the exception that one is no longer assured of having 
at least one cluster in any given document network. 
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Theorem. There exist document networks which contain 

no subset clusters. 

Proof . Examination of each of the 2 possible subsets in the net- 
work of Fig. U.5 reveals that none of them satisfy the two conditions 
necessary for subset clusters. QED 

(Ci) (3) 

6 |^ ]3 Links not shown are -5 • 

/xT\ /xT^\ 

Fig. li.5. Network containing no subset clusters. 
Structure . Next we note that a hierarchal structure is no longer 
possible with subset clusters. 

Theorem. No subset cluster X Q can be included within another 

P 

subset cluster X . 
a 

Proof. Let us assume that X and X Q are subset clusters and that 
a p 

X,CX . Since X is a cluster and X Q CX , then by Cond. 1 of the defini- 
te a a pa 

tion: 

c[x p (x a nXp)]>o 

But since X Q is a cluster and (X 0*7)^x7 then by Cond. 2: 
p 1 p p 

c[x p (x a nx^)]<o 

which contradicts the previous inequality QED 

Relatedness . In the last section it pointed out that one of the 
difficulties with local maximum clusters lies in the fact that even com- 
pletely uncorrelated sets of documents can occur in the same cluster. 
It was for this reason that the subset definition was devised. In sub- 
set clusters one is assured by definition that no subset of the cluster 
is negatively correlated to the remainder of the cluster. 
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Utility . The problans of coverage and hierarchy did not prove to be 
serious drawbacks to the subset definition of clusters. An extension to 
the definition was devised which allowed all documents to be in at least 
one cluster and provided for hierarchal relationships. This extension 
involved applying a bias to the links of the network. (See Sec. 2*. In) 
The reason the subset definition was finally abandoned was because no 
method could be found that would isolate subset clusters with a reason- 
able amount of effort. 

Consider for a moment the problem of checking Condition 1 of the 
subset definition. One must determine whether there is a partitioning 
of a set of documents which results in two subsets that are negatively 
correlated to each other. The brute force method is to try every parti- 
tioning. This would involve 2 tests for a set of n documents and would 
certainly be too much processing for an n of 20 or 30 even on a high 
speed digital computer. Several efforts were made to devise a more 
efficient method. Although they were not entirely successful, it might 
be well to briefly document a couple of them. 

k.3 Finding Subset Clusters 

In the first method for finding subset clusters which was investi- 
gated, an effort was made to determine if a partitioning of a set existed 
which would result in two negatively correlated subsets. Such a parti- 
tioning is called a 'split' of the set in the following discussion. 

In the other approach emphasis was focused on the small, very 
highly correlated subsets called 'kernels' within the document set and 
an attempt was made to combine and expand these until a split appeared. 
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it. 31 Locating Splits 

We wish to devise a method which will determine whether a set of 
documents can be split into two negatively correlated subsets and to 
locate where such splits are. Some of the theorems that were developed 
for this purpose will be stated below. In the interests of brevity the 
proofs will not be given. The symbols used in these theorems are 
defined as follows. 

n - number of documents in S, the sets under consideration. 

a - number of documents in a subset A of S. 

b - number of documents in a subset B where B=sDa. (a+b=n,A(jB=S) 

K - negative value assigned to links for which N .=0. 

C . - smallest value of the links for which N. .^0. It will be 
nun ij' 

assumed in the following theorems that C . is oositive. 

ram 

(See Sec. 3.5.) 

C - largest positive link in the network. 

max ° r 

d - number of links in the set S which have the value K. 
Theorem 1 : Consider the partitioning of a set of 

documents into the subsets A and B. 

Part A: Only those parritionings which satisfy the following in- 
equality can possibly result in splits. 



/C . +|K| 
(a)(b)< L mln )d 



'-\C 

\ nun 

Part B: A necessary condition for a partitioning to result in a 
split is that the partitioning must be crossed by at least 
r negative links where: 
(a)(b)(C min ) 



C . +IKI 
min I I 
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Part C: A sufficient condition for a partitioning to result in 
a split is that the partitioning be crossed by at least s 
negative links where: 

(a)(b)(C ) 

max 
s = 



c + |k| 

max ' ' 



Example of Theorem 1 ; 

n - 20 
K - -5 

C min " h 

d « itO (ii.0 of the 190 links are negative) 

By Part A of the theorem (a)(b) must be less than 90 to allow a 
split. Therefore partitionings with distributions a:b - 10:10, 9:11, 
8:12, and 7:13 cannot possibly result in splits. This immediately 
eliminates about 9O5C of the possible partitionings as candidates for 
splitting the set. Unfortunately there are some 60,1*60 partitionings 
that still must be considered which is still out of the question. 

However if the 1;0 negative links are all bunched on only 5 of the 
nodes (8 per node), then by Part B of the theorem only 6l partitionings 
can possibly cause splits and these can easily be checked. 

If only IO56 of the links are negative (19 instead of Uo), then only 
partitionings with a:b - 1:19 and 2:18 can cause splits. There are 210 
such partitionings and a check of these would also be possible. 

However in the general case C . may be small, d may be large, and 
the negative links may not be so fortuitously arranged so that the parti- 
tionings which must be examined may still remain very large. 

Theorem 2 is concerned with the possibility of finding splits of 
the set S as it is being formed. 
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Theorem 2 . Consider the possibility of a set of documents 
being split by the addition of another document. Three statements 
can be made. 

1. If the new document is positively correlated to each item 
in the set, then no split can be created. 

2. If a split is created, it must be crossed by at least 
one newly added negative link. 

3. The sum of the newly added links crossing any split 
created must be negative. 

The next two theorems will help to determine whether the set S is a 
subset cluster when it contains one or more documents that are positively 
correlated to all of the other documents in S. 

Theorem 3 . If a set of n documents has d or more documents 
that are positively linked to every other document in the set, 
then the set has no splits. 
n |K| 



d - 



C min + ' K ' 



Theorem h . Assume that a set of documents has splits. Now 
remove all those documents that are positively correlated to 
every other document in the set. The reduced set must also 
have splits. 

The sum of the links connecting documents in the subset A to docu- 
ments in B is termed the cross correlation of the partitioning which 
created A and B. The following three theorems relate to this cross 
correlation. 

Theorem 5 . The cross correlations of all possible parti- 
tionings of a document set are equal if and only if every link 
has the value 0. (n>3) 
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Theorem 6 . The cross correlations of all possible parti - 
tionings of a document set of size a:b are equal if and only 
if every link has the same value. 

Theorem 7 - The average cross correlation of the parti - 
tionings of size a:b is C(s)(a)(b)/( n ) where C(S) is the total 
internal correlation of the set. 

h.32 Forming Kernels 

Another method which was considered as a way for determining if a 
set was a subset cluster was to form highly correlated kernels within 
the set in question and thereby try to locate possible splits. The ker- 
nels might initially be those subsets wherein every document is posi- 
tively correlated to every other document. These sets could then be 
combined in various ways to see if any splits appeared. The following 
two theorems relate to this approach. 

The symbols used are as defined in the last section and as follows: 

C - average of the positive links of the set. 
avg 

D. - The i disjoint kernel of the set S. 

D 1 u...|jD t c:s 

D^D " null set for all i,j (i/j). 

Theorem . If the sum of the internal correlations of a set 
of disjoint kernels is greater than or equal to the total 
internal correlation of the set, then there is at least one 
split in the set. 



other words, if: 2_ C^^Cfs) 



In 

i=l 

then S has at least 1 split. 

Theorem . A sufficient condition for having at least one 
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split in a set is that the set contain at least d negative 
links where : 



d 



(2)C —A C(D.) 
*■ avs; i=l i' 



avg 



C + K| 

avg ' ' 



h.it Biased Clusters 

In this section an extension or modification to the cluster defini- 
tions is proposed. It was initially devised in order that subset 
clusters could have a hierarchal structure. It was found to be a useful 
modification to local maximum clusters also. 

As a way of introducing the concept of a biased cluster, let us con- 
sider a large cluster (either local maximum or subset) of documents 
covering a rather broad field of interest. There will, of course, be 
users who want all of the documents in such a cluster, but what about 
the users whose interests are very specific and who want only a small 
portion of the cluster? As yet there has been no provision for such a 
narrowing of interest. Subset clusters and many local maximum clusters 
are not decomposable. We shall now present the theoretical basis of a 
method which will allow a cluster to be reduced to a more specific set 
or enlarged to a more general set. 

Consider a set of documents, W-=(w ,...,w ), which forms a cluster 
in the overall document network. The problem of retrieving a portion of 
this cluster is regarded as equivalent to the problem of finding a 
cluster in the sub-library consisting only of W. 

In order to show how this might be done let us define a new sample 

r n 

space which has only 2 points instead of the 2 points of the original 

sample space. Each point in the new space represents a possible parti- 
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tioning of W. To distinguish between the probabilities of the two 
sample spaces, the probabilities of the old sample space will be given 
a subscript 'a' and the probabilities of the new sample space a sub- 
script "p". Let the probabilities assigned to the points of this new 
sample space be initially equal to the marginal probabilities of the 
corresponding events over the old sample space. 

p p (w 1 ...w r ) -p a (w r ..w r ) = £ P a (V--* n ) 

over all x 
not in W. 

The marginal probability, p (w . . .w ), is the sum of the probabil- 
ities of all those elementary events in which none of the documents in W 
are in the subset of interest. Since these events are irrelevant when 
one is considering only the sub-library W, let us set p^w^-.w^) equal 
to 0. Such a step requires that the other p fl (w . ..w )'s all be increased 
by a normalizing factor k. The final values for the probabilities 
assigned to the new sample space can now be specified. 

p (w . ..w ) = kp a (w 1 ...w r ) for all p (w.^ . .w r )except p (w^.-w.^) 

k = l/[l-p a (w°...w°)] 

Now let us consider the effect of this change in the sample space 
on the correlation of any two documents in W. 

C (w w ) = log 

P a (w l )p a (w 2 ) 

C (w^ 1 ) = log -i-iJ r 

Pp(w 1 ) Pp (w 2 ) 

(k)p a (w JL w 2 ) 

" log- 



(k)p a (w^)(k)p a (w^) 
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/ 1 1\ 



log a * 2 ± - log (k) 
Pa (w l )p a (w 2 ) 



Cp(wJ^) - C o (wJwJ) - log (k) 

Thus the correlations for the sub-library can be obtained by merely 
subtracting a constant or bias from the correlations for the full library. 

An alternative way to describe this approach is through the frequency 
counts used in making the probability estimates. Instead of considering 
all the available partitionings of the document file, let us consider 
only those partitionings in which one or more of the documents in W occur 
in the subset of interest. Let us denote the counts based on this re- 
stricted set of partitionings by the letter M and use H for the original 
counts . 

K i = M i for all i in W. 
N - H ± . for all i,j in W. 

Now let us consider what happens to the approximation to C based on 
the probability estimates with the new frequency counts. 

cl(wV) = log 2J_ 

" 1 J MM 

- log ±J- 



H i H J 



*/lli S/ 11 



IT N. , N 

log 14- . log — 

»,», M 



C pKv = C a (w iV " l0g (N/M) 
Here again we note that we can in effect reduce the size of the 
library under consideration by merely subtracting a constant from each 
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correlation value . 

In an analagous manner we can increase the size of the library and 
thereby obtain larger, more general clusters by adding some bias to each 
correlation in the network. 

We now observe that of the three measures which meet the criteria 
outlined in Sec. 3.2 0,h, and 8) only Measure 8 allows this type of 
narrowing an broadening of the request range. Measures 3 and h are in- 
sensitive to any change in the size of the library or partitioning file. 

One final question arises concerning the biasing of the value K 
assigned to links for which N -0. One could either let the bias affect 
all links equally or one could look upon K as a fixed value which is not 
changed by the bias. The latter approach was rather arbitrarily 
selected. 

We are now ready to define what is meant by a biased cluster. 

Definition: Biased Cluster 

A biased local maximum cluster has the same definition as 

a regular local maximum cluster, but a non-zero bias has been 

applied to the document network in which the cluster is formed. 

The same is true of a biased subset cluster. 

In summary, a simple, easy-to-use method has been suggested which 
will allow the size of clusters to be increased or decreased. Some 
arguments have been presented which show that the method has a sound 
theoretical basis. 

h*5 Final Cluster Decision 

The local maximum definition of clusters was reconsidered after no 
general method for finding subset clusters was found. It was pointed 
out in Sec. it.l that local maximum clusters were considered unacceptable 



71* 



because totally unrelated subsets of documents could be part of the 
same cluster. The following theorem and lemmas show that this diffi- 
culty can be avoided by selecting an appropriate value for K. 

During the remainder of this section it will be assumed that all of 
the links for which N ^0 are positive (See Sec. 3.5)- If this condi- 
tion does not hold then the theorems and lemmas which follow can be 
restated in terms of links for which N. .»0 and links for which N. JO 
instead of positive and negative links. 

Theorem . Each document in a local maximum cluster of n 

documents is positively linked to over half of the remaining 

n-1 documents if K<^-C 

— max 

Proof . By definition each document in a local maximum cluster is 
positively correlated to the remaining (n-l) documents in the cluster. 
Now if the positive links are smaller or equal in magnitude than the 
negative links, then it stands to reason that there must be more of the 
former to yield a positive sum. 

Lemma . Consider a local maximum cluster that is parti- 
tioned into 2 subsets, X and X g , with X„ the larger if they 

differ in size. If K<-C , every document in X has at 

— max' a 

least one positive link to the other subset. 

Lemma. In a local maximum cluster with K<-C there 
— max 

can be no subset that is totally uncorrelated (has no positive 

links) to the remainder of the cluster. 

The choice of K<-C does not insure that a local maximum cluster 
— max 

will be free of splits and thus be a subset cluster. Subsets can still 
be negatively correlated to the remainder of the cluster. But it does 
insure that the rather strong type of relatedness expressed by the above 
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two lemmas will exist for each partitioning of a local maximum cluster. 

Another advantage to choosing K^-C is that it provides the 

max 

system with a very simple test of whether two documents can be in the 
same local maximum cluster. 

Theorem . If K<-C then two negatively linked documents 
max * 

can occur in a local maximum cluster together only if they are 

positively linked to at least one common document. 

Proof . Consider a local maximum cluster of n documents. Assume 

that there are two negatively correlated documents, x and x„, in the 

a p 

cluster. By the previous theorem x must be positively correlated to 

over half of the (n-l) other documents in the cluster. Since x is not 

a 

positively correlated to x Q it must be positively correlated to more 

than half of the remaining (n-2) documents. This is true of x„ also. 

P 

Thus they must be positively correlated to at least one common document. 

Next let us consider what value should be assigned to K to insure 

that K<-C max - In Sec. 3.5 it was shown that the largest value that the 

estimated correlation can possibly take is (log N) where N is the number 

of available partitionings of the document file. Thus if we make K equal 

to (-log N) we will be assured that K < -C 

max 

So far some reasons have been given indicating that it might be 
expedient from a practical standpoint to make K equal to (-log N). Let 
us now consider whether this value for K is justifiable theoretically. 

It was noted in Sec. 3.5 that if the frequency counts are based on 
a finite number (n) of partitionings, then none of the probability 
estimates can fall between and l/N. This results in those correlations 
which might have been in the range -oo to (2-log N) being estimated to 
be -oo( r perhaps some value greater than (2-log N)). It was suggested 
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that those correlation estimates that are - 00 toy the formula might toe 
more appropriately adjusted to some finite negative value, K, since a 
correlation of - oo implies that there is atosolutely no chance of the two 
documents ever occurring together. 

Thus K can toe considered an approximation to the correlations in the 
range -oa to (2 -log N) and it would seem appropriate that it assume some 
value within that range. Consider also what value K should assume as N 
approaches oo. It is suggested that K should approach -OO as H 
approaches oo since those document pairs for which »„ still equals in 
the limit do in fact never occur together and Cfx^x.) should toe -oo. 

There are two other consequences to making K=-log N that should toe 
noted. It gives the correlation a symmetric range atoout (-log N to 
log N). It also forces the correlation of documents that have never 
occurred together to always be less than the correlation of documents 
that have co-occurred [(-log H)<(2-log N)]. 

The local maximum definition is therefore selected for use in this 
project. Its definition is extended to include toiased clusters and it 
is required that K » -log H. Hereafter we will refer to a local maximum 
cluster as just a cluster. 
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CHAPTER V 
SEARCH PROCEDURE 

The last component of the theoretical model is the procedure which 
transforms a request for information into the set of documents that com- 
prise the answer. The first step in describing the procedure will he to 
make a number of definitions. Then a list of features that a suitable 
procedure should have will be given. Finally the particular procedure 
developed for this project will be described and analyzed. 

5.1 Definitions 

Definition; Request 

A request for information from the system is defined to con- 
sist of two subsets of documents. One subset, Y=(y, ,...,y ), 
contains those papers known by the user to be pertinent to the 
current search. The other, Z=(z.., ...,z ), contains those papers 
that are known to be not pertinent. The Y subset must be non- 
empty but the Z subset can be empty. 
Definition: Answer 

An answer to a request is defined to be a cluster of 
documents which includes the Y subset of the request and 
excludes the Z subset. 
Definition: Clustering Procedure 

Any algorithm which transforms a request into an answer 
will be termed a clustering procedure (sometimes hereafter just 
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called a procedure). We will consider for this project only 
clustering procedures which are iterative in nature and which 
on each iteration change the contents of a certain set of docu- 
ments, S=(s 1 ,...,s ). Upon termination of the procedure S is 
to he the answer set. For most of the procedures considered 
here only a single change is made to S on each iteration. The 
S generated hy the i iteration can he distinguished hy a 
subscript (S. ). 
Definition: Convergent Procedure 

A convergent procedure is one that terminates after a 
finite number of iterations. 
Definition: Inconsistent Request 

A request is said to be inconsistent if there is no answer 
cluster for any bias which satisfies the request. 
Definition: Ambiguous Request 

A request is said to be ambiguous if there is more than 
one answer cluster which satisfies the request. Note that one 
must consider all possible biases in determining ambiguity. 
Requests with empty Z sets will generally be ambiguous. This is 
because larger and larger answer clusters can be formed by increasing 
the bias. For example, the request of Fig. $.1 is ambiguous having the 
following four possible answers. 

Answer Bias 



(y L ) -co -*- -h 

(y^) -h -» -3 

( y l X l X 2^ " 3 ~* +7 

(y 1 x 1 x 2 x 3 ) +7 -*+°o 
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Links not shown are -5 



Y=( yi ) 
Z«( ) 

Fig. $.1. Ambiguous Request. 

5.2 Attributes of a Good Clustering Procedure 

In this section we shall list some characteristics which the 
clustering procedure should have. It will he assumed that the definition 
of a cluster of documents as given in Chapter h is suitable. If this is 
the case, then the basic objective of a clustering procedure would be to 
locate the appropriate cluster in an efficient way. 

1. Request Satisfaction 

If the request is unambiguous and consistent, then the procedure 
should produce the one cluster which satisfies the request. 

2. Request Modification 

If the request is ambiguous or inconsistent, then the procedure should 
be able to recognize this fact and should help the user to modify his 
request. This suggests that the procedure should allow close man- 
machine coupling so that information generated by the clustering process 
can be presented to the user for his examination and modifications to the 
request can be fed back into the system. 

3 . Convergence 

The procedure should be convergent for every possible request and 
document network. Whether it is forming an answer cluster or determining 
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request ambiguity or inconsistency, it should never fall into a repeti- 
tive, non-terminating cycle. 
U. Minimal Number of Iterations 

The procedure should find the answer in as few iterations as 
possible. An excessively large number of deletions of previously added 
documents from the set being formed would be undesirable. 

5.3 Description of Procedure 

A description and flow chart of the procedure developed for this 
project will be presented in this section. An analysis of the procedure 
will be given in Sec. £.$. 

Fig. £.2 is a block diagram showing the overall structure of the 
procedure. Before attempting to describe each block in Pig. 5.2 in 
detail let us make some general comments about the procedure. 

There are three basic phases which the procedure can enter depending 
on the amount of bias required and the relationships of various documents 
and sets of documents. 
Phase I; Ho Bias 

The procedure starts in this phase, remains in it as long as no bias 
is required, and returns to it from Phase II if at some point the bias 
can be reduced to zero. The documents considered for addition to S in 
this phase are those (positive to S) which keep each y^ in Y positive to 
S (or at least increases its correlation to S) and keep each z^ in Z 
negative to S (or at least decreases its correlation to S). Of these 
candidates the one with the highest correlation to S is selected for 
addition to S. If at some point there are no more documents that are 
positive to S,then the procedure terminates. If there are documents 
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I 



Initialization 



<s>- 



Condition 1 of Cluster 
Definition 

Are there documents in 
S that are negative to S? 
(Y's excluded) 



no 



Condition 2 of Cluster \ 
Definition 



Are there documents not 
in S that are positive 
to S? (Z's excluded) 



yes 



I 



no 



Is Y included in S and 
Z excluded from S? 



yes 



Are there request docu- 
ments in trouble by the 
above test which are in 
both Y and Z? 



yes 



®- 



Delete a document 
from S. 




-> exit 




Mark request as 
inconsistent. 



Fig. 5.2. Overall Flow Chart. 
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that are positive to S but none of them meet the above conditions with 
respect to Y and Z, then it is concluded that some bias will be needed 
and Phase II is entered. 
Phase II: Bias 

In Phase II the bias is either made positive enough to keep all the 
y 's positive to S or made negative enough to keep all the z^s negative 
to S. On each iteration those documents that are positive to S by the 
current bias are considered for addition to S. Of these candidates the 
document which requires the least bias when added to S is selected for 
addition to S. If at any time the bias becomes zero the procedure 
returns to Phase I. 

When there are no more documents that are positive to S, the pro- 
cedure either terminates or enters Phase III. Actually certain constraints 
are placed on the amount the bias can change on any one iteration. This 
means that all of the request documents may not be properly correlated to 
S (y./s positive to S and z.'s negative to S) at the end of Phase II. 
If they are all properly correlated to S (i.e. the request is satisfied), 
the procedure terminates. If they are not yet properly correlated to S, 
the procedure enters Phase III. 
Phase III: Monotonic Bias 

The purpose of this phase is to either make positive to S certain y^ 
that are not currently positive to S or to make negative to S certain z^ 
that are currently negative to S. This is accomplished by allowing the 
bias to move in only one direction while suitable additions and/ or 
deletions are made to S. One may not return to Phase I or II from Phase 
III. Phase III and the procedure terminate when the y.'s and z i 's are 
correctly linked to S. 
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The detailed flow charts for the general blocks of Fig. £.2 will be 
greatly simplified if we first define a number of symbols. 
Flow Chart Symbol Definitions 
: The null set. 
f|: Set intersection operator. 
(J: Set union operator. 

S: Set of all documents not in set S. (Complement) 
C: Set inclusion: AdB means set A is included in set B. 
Y: The set of all documents specified as interesting by the user. 
Z: The set of all documents specified as not interesting by the user. 
S: The set which is being formed into the answer cluster by the 

procedure. (Yds) 
P: The set of all documents positively correlated to the set S by the 
current bias. A document in S is in P if it is positively 
correlated to the remainder of S. 
Q: The set of documents included in P but not in S or Z. The document 

to be added to S will be chosen from this set. Q-PHsfiZ 
T: The set consisting of those documents in Q which will not require 
positive bias if added to S. Document t. is in T if when it 
is added to S it will do one or both of the following opera- 
tions for every document y, in Y. 

(1) Keep y. positive to the new S. C[y.(SUt.)T> 

J J 

(with bias) 

(2) Increase the correlation of y, to S. C(y,t i )>0 

(with bias) 
V: The set consisting of those documents in Q which will not require a 
negative bias if added to S. Document v^ is in V if when it 
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is added to S it will do one or both of the following opera- 
tions for every document z . in Z. 

J 

(1) Keep z negative to the new S. C[z (SUv.)]<0 

(with bias) 

(2) Decrease the correlation of z. to S. C(z v )<0 

(with bias) 
X: The set of documents which are candidates for addition to S. If 
there are one or more documents in Q that require no bias if 
added to S, then X contains those documents. Otherwise it 
contains the documents that require a change in bias in only- 
one direction. 
W: The set of documents which are candidates for deletion from S. A 
document w. is in W if it is negatively correlated to the 
remainder of S by the current bias and if it is not included 
in Y. 

C[w 1 (sOw i )]<0 w^sfiY 
f : Number of positive links in the set S. (with no bias) 
g : Number of positive links from document x to S. (with no bias) 
d.: Bias required for the set (sUx ). If x.(ZTf\V then & ± is just 
negative enough to keep each z. negative to (SU*jJ. If 
x.Cvflf then d. is just positive enough to keep each y 
positive to (SUx.). If X=TflV then & is made 0. 
BIAS: Current bias, 
b. : Allowable change in bias if x. is added to S. 

b i =minimum [ (d -BIAS),l,10/(f +g 1 ),C(x ± S)/(f +g i ) ] 
(C above is by current bias.) 
R: The set of documents in X that would keep the bias at or allow it 
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to be reduced to if added to S. 

(BIAS + bj » for all Xj CR 

We are now ready to present more detailed flow charts for the 
blocks of Pig. 5.2. Fig. 5.3 covers block 1, Fig. $.h covers blocks 2 
and 3, Fig. 5.5 covers blocks h and 5, and Fig. 5.6 covers blocks 6-9. 
A brief comment is made to the right of each step in these detailed flow 
charts as an aid to understanding them. More precise statements of 
their functions are given in Sec. 5.5. 

5»it Earlier Procedures 

For historical purposes and for comparison and analysis, let us 

briefly document some of the earlier procedures which were considered. 

Procedure 1 

Briefly this procedure transforms a request into three subsets— 

A: the set of documents related to the request. 

B: the set of some of the documents not related to the 

request. 

C: a "limbo 1 set of documents positively correlated to both 

sets A and B. 

Initially set A contains only those documents specified as 

interesting by the user, and set B contains those documents speci- 
fied as non- interesting. On each iteration all documents positively 
(negatively) linked to A(B) and negatively (positively) linked to 
B(A) are added to A(b). Documents positively linked to both A and 
B are placed in limbo while those negatively linked to both are 
ignored. All changes to the sets A, B, and C are made concurrently 
at the end of each iteration. 
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INTERACTION POINT 



s o = Y 



I 



PHASE 111= NOT YET 



[S 



BIAS = 
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1. Allow user to specify initial Y and 
Z sets. 

2. Put the interesting documents in S. 



3. Indicate that the procedure is not 
yet in the third phase. 

U. Start with an initial bias of 0. 



Fig. 5.3. Initialization 
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5. Check if there are documents in S 
that are negative to the remainder 
of S. 

6. Point at which information can flow 
between the user and the system. 
(e.g. status of clustering procedure, 

data on particular documents, modi- 
fications to the request, etc. ) 

7. Delete a document from S. 



Fig. 5.U. Condition 1 and Deletions. 
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yes 
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no 
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8. Check if there are any more docu- 
ments positive to S. 

9. Check if there are documents posi- 
tive to S that keep (or try to keep) 
all the y' s positive and all the 
z's negative. 

10. Check if there are documents which 
require a change in bias in only 
one direction. Note that TUV - 
(Tnv)U(VOf) at this point. 

11. Load the set X with the candidates 
for addition to S. 

12. Check if one or more documents in X 
can allow the bias to drop to zero. 



In teractionyTY Interaction 
Point AA Point 



S(Jx. 



Where C^S^C^S) 
for all x.CR. 



>»• 



s - sU^ 

Where iBIAS+bJ^BIAS+bJ 

for all x.CX. 



Point at which information can flow 
between the user and the system. 
(e.g. status of clustering procedure, 
data on particular documents, 
modifications to the request, etc.) 



ill. Add a document to S. The document 
x. is the x ± in R for which cUjS) 
is a maximum. (Based on current 
bias.) 



15. Add a document to S. The document 
x. is the x. in X for which the 
magnitude of the allowable new bias, 



BIAS ■= (BIAS+b. ) 



Where b k i6 for the x^ 
added to S. 



S 



16. 



JBIAS+b. I , is a minimum. 



Change the bias if necessary. (Sign 
of b. is modified by PBASE III to 
allow change in one direction only. 



Pig. 5.5. Condition 2 and Additions. 
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no 
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PHASE III - 
DECREASE BIAS 
ONLY 
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PHASE III = 
INCREASE BIAS 
ONLY 



BIAS » BIAS + Minimum (l,10/f ) 
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INTERACTION 
POINT 



)-© 



zrK, 



J 



Tests for Request Documents 
in Trouble 

17. Check if all the documents in 
Y are positive to S. 

18. Check if all the documents in 
Z are negative to S. 



19. Termination of procedure. 
The answer cluster is S. 



Phase III Bias Change 

20. Check if this is the first 
time through Phase III. 



21. Set PHASE III switch to allow 
bias to change in only one 
direction. 



22. Make maximum change in bias. 
(The sign depends on the 
Phase III switch.) 



Inconsistent Request 

23. The request is considered 
inconsistent since the bias 
must go up and down simulta- 
neously. The user is informed 
of this fact and allowed to 
ask questions and/ or modify 
the request. 



2i+. A document is chosen for 

deletion from Z if the user 
has not already modified the 
request. 



Fig. £.6. Phase III and other tests. 
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Procedure 2 

This procedure is the same as Procedure 1 except that only one 
change is made to set A or set B at a time. Thus, the most posi- 
tively correlated document is added and then the most negative docu- 
ment is deleted from each set. 
Procedure 3 

The basic difference between this procedure and Procedure 2 is 
that the criteria used to determine which document to add to set A 
or B is that it be most positively related to the original request 
instead of the current trial subset (S). Only those documents that 
are positively correlated to S are considered for addition. Within 
this set, selection is on the basis of correlation to the original 
request. 
Procedure h 

This procedure attempts to combine the advantage of Procedures 
1 and 2. All documents positively correlated to either sets A or B 
(but not both) should be added to them on the first iteration as in 
Procedure 1. Subsequently only single changes are made to the sub- 
sets as in Procedure 2. 

Let us briefly note here why these earlier procedures were rejected. 
All of these procedures have a single subset B into which the documents 
considered not pertinent to the search are placed. This subset is 
treated just like the subset of pertinent documents and an attempt is 
made to form it into a cluster also. 

The difficulty with such an approach can be seen by the example of 
Fig. 5.7. By the above procedures the non-pertinent set B is initial- 
ized with Z=(z 1 z ). Further additions to B are not possible because x 1 
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and x~ are both negative to B. This is because the non -pertinent set is 
really not one cluster but two clusters. Since x.. and x are negative 
to B, one of them can be added to A. This will make x, and Xi negative 
to A and divert the procedure from the desired cluster. Basically what 
has happened is that the usefulness of the documents in Z has been 
hindered by requiring that they form a single cluster. 



Links shown are +5 
Links not shown are -6 




(y x ) 



Z " ( z ! z 2^ 



Pig. 5.7. Example showing why non -pertinent documents 
should not all be grouped into one cluster. 



This would lead one to suggest that perhaps a separate cluster 
should be formed around each document in Z. There are some reasons why 
this would not prove useful in addition to the fact that it would eat up 
an excessive amount of effort in the formation of non-pertinent clusters. 
Consider the example of Fig. 5.8. Let us assume that x, is added to A 
and X£ to B on the first iteration. Now on the second iteration xi can 
be added to A because it is no longer positive to B. The cluster 
(x-x_y, ) is again not found because the non-pertinent cluster formed 
around z 1 was (z^gx^) instead of (y-x-x.z-). The point here is that 
the Zj^'s will be in a number of clusters and one does not know exactly 
which cluster to form around z. in order to divert S in another direction. 
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Links shown are +£ 
Links not shown are -6 



Y=(y x ) 
z=( Zl ) 

Desired cluster: (y x.x 2 ) 

Cluster to he excluded by z_: (y x-jX.z ) 

Fig. 5.8. Example of difficulty with forming clusters 
around non -pertinent documents. 



£.5 Analysis of Procedure 

Thus far the clustering procedure selected has been described and 
flow charted and a brief explanation of the purpose of each block has 
been given. Also certain earlier procedures have been briefly sketched. 
We shall now analyze the effectiveness of the selected procedure in 
terms of the objectives of Sec. £.2. 

5.%1 Request Satisfaction 

The procedure selected and most of the other procedures considered 
to date operate by making single changes to a set S which initially con- 
tains the Y set of the request. Documents not in S that are positively 
correlated to S are considered for addition to S and documents in S that 
are negative to S are considered for deletion from S. Let us first 
settle the question of whether it is possible in general for a procedure 
of this type to locate an answer cluster if one exists. 

Theorem . It is always possible to transform a set S which 

initially contains only the Y set of the request into a (subset) 
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answer cluster if one exists by successively adding to S 

documents that are positively correlated to S. 

Proof . The proof of this theorem will be constructive. 

(1) Initialize the set S with Y. 

(2) If S coincides with the answer cluster A, the procedure 
can terminate. 

(3) Otherwise, consider the set of documents (Afls) yet to 
he added to S to form A. By the definition of a subset cluster in 
Sec. U.2, (aOs) must be positively correlated to S and thus there is 

at least one document in (aHs) that is positively correlated to S. Add 
this document to S and go back to Step (2). QBD 

Note that this theorem is true only for subset clusters. We can 
show that it does not hold for local maximum clusters by the example of 
Fig. $.9' Bie set (y^x x ) forms a local maximum cluster ,but it cannot 
be reached from the set S ■"(y-.yp) by the addition of documents positively 
correlated to S. 

(*1) 

Links now shown are -5 

Fig. 5.9- Local maximum cluster not accessible to procedure. 

Even when K<-C the theorem still does not hold for local maxi- 
max 

mum clusters. In the network of Fig. 5.10 the set (y^^) again fornB 
a local maximum cluster, but it cannot be reached from the set ^"(y-tyo^ 
by the addition of positively correlated documents. 

i^y (l) Links shown are +J4 

p^ N ( xZ) Links not shown are -5 

Fig. 5.10. Local maximum cluster not accessible to procedure. 
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Actually it may be a distinct advantage if procedures of the type 
being considered cannot reach certain local maximum clusters. It was 
noted in Sec. h»$ that a procedure which produces subset clusters only 
would be preferred over one that resultB in local maximum clusters; but 
that such a procedure had not been found. The above theorem and comments 
show that procedures of the type selected can generate for a given 
request all of the subset clusters which satisfy a given request. In 
addition they may locate some (but not all) of the additional local 
maximum clusters which satisfy the request. 

Let us now observe that we have so far only proved that a suitable 
clustering procedure of the type suggested may exist. The 'constructive 
proof of the theorem does not indicate how to choose the correct docu- 
ment to add to S in Step (3) if several documents are positive to S. 
One could, of course, try all possibilities. Let us represent these 
possible additions by a tree where each branch out of a node represents 
the addition of a positively correlated document to S. In the example of 
Fig. 5.11 there are three documents positively correlated to y., two 
positively correlated to the set (y^x.. ), etc. 

S Q : (y n ) 



S l : ( y l x l^ ^1*2 ^ ( y l x 3^ 
V (y l x l X 2 ) < y l x l x U ) ( y l x l x 2 ) 
Fig. 5.11. Possible additions to S. 

A procedure which traversed all of the branches of such a tree 
would be assured by the preceding theorem of finding an answer (subset) 
cluster if one existed. However, one can quickly convince himself that 
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such an exhaustive examination of all possible positively correlated 
additions is, in general, completely impractical because of the magni- 
tude of the task. What is needed is some way of determining which of 
the positively correlated documents should be added to S on each itera- 
tion. 

There will, of course, be cases where the answer cluster is 
obtained no matter which of the positively correlated documents is added 
to S on a given iteration. A simple example of a request and network 
for which this is the case is given in Fig. 5.12. On the first itera- 
tion one can add either x or x ? and still end up with the answer 
cluster (y^o*!*?^" 



Links shown are +h 



Fig. 5.12. Network where it does not matter which document 
is added to S first. 



However, in the more general case the choice of which document to 
add to S on each iteration is a very critical aspect of the clustering 
procedure. The answer to a request may not even be found if the wrong 
document is added to S on one or more of the iterations. As an example, 
consider the network and request of Fig. 5.7. If the procedure were to 
add x, to S on the first iteration, then (y^^x, ), the only cluster 
which satisfies the request, would not be found. 

Let us now describe the criteria used by the procedure of Sec. 5.3 
to decide which document to add to S on each iteration and note how 
these criteria might help in obtaining an answer cluster if one exists. 

In Steps 9-11 of Fig. S%S preference is given to documents that are 
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positively linked to each y (or else leave the y positive to S) and 
negatively linked to each z. (or else leave the z, negative to S). The 
network of Fig. 5.7 serves as an example of how this preference might 
aid in obtaining the answer cluster. Documents x^ and Xi are considered 
for addition to S before x. and x ? and the answer cluster (y x,x, ) is 
obtained. 

Steps 12 and l5 of Fig. 5.5 are for the purpose of minimizing the 
bias on each iteration and will be discussed when we talk about request 
modification and ambiguity. 

In Step lU the document which is selected for addition to S is the 
one that has the highest positive correlation to S from among those docu- 
ments that have met all of the earlier criteria. 

The theorem at the beginning of this section shows that the only 
operation that a procedure needs to perform is the addition of positively 
correlated documents to S if the appropriate document to be added on 
each iteration can be determined. If, in fact, the procedure mistakenly 
adds on a given iteration a document which is not part of the answer, 
then it may still be possible to arrive at the answer if the procedure 
is allowed to also delete documents that have become negatively corre- 
lated to S (Steps 5-7 of Fig. 5.1*). In the network of Fig. 5.13 the 
answer Si =(y.,y 2 x x„) is obtained even though S "(y.y.x,). 

Links shown are +U 
Links not shown are -5 

Fig. 5.13. Network showing that the procedure must be 
allowed to delete as well as add. 

Despite the above features which help in the choice of the docu- 
ment to be added on each iteration, there are still cases where the 
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procedure of Sec. 5.3 does not find an answer cluster even when one 
exists. Consider the request and network of Fig. 5. lit. Documents x^, 
x-, and x^ are linked to the documents in sets Y and Z by exactly the 
same values and are all candidates for addition to S on the first itera- 
tion. If the first document to be added is either x, or x^, then the 
procedure finds the cluster (x-X-y y_) which is the only valid answer 
cluster for the request. If, however, x^ is added to S first, then the 
procedure reaches a point where no bias can be chosen which will simulta- 
neously keep y., and y ? positive to S and x negative to S and the request 
is Judged inconsistent. 

*x7\ /x?\ \ 

Links shown are +k unless 
otherwise indicated. 

Links not shown are -5. 
Y=( yi y 2 ) 

z-( Zl ) 

Only valid answer cluster = (yi y 2 x l x 2^ 

Fig. 5. lit. Hetwork illustrating the difficulties involved 
in knowing which document to add to S on a 
given iteration. 




The alternatives open to the procedure for the network of Fig. 5. lit 
are shown in the decision tree of Fig. 5.1$. It should be pointed out 
that all of the procedures discussed in this chapter decide which docu- 
ment to add to S on each iteration on the basis of the relatedness of 
the document being considered to the documents in the S, Y, and Z sets 
only. The inter-relatedness of the documents not in S, Y, and Z is not 
a factor in the selection. Indeed, from a practical standpoint, it can- 
not be used as a factor in the decision, since it would necessitate 
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considering the consequences of adding subsets of documents instead of 

single documents and for r documents under consideration there are as 

r 
many as 2 subsets to consider. 




V 



V (y^y^) (y^xg) (y^) 

V (y 1 y 2 V2 ) (y i y 2 x i x 2 } ^i^V C^w*) 

S 3 : ^l y 2 x 3V5 ) (y l y 2 x 3V^ ) 

Inconsistent Inconsistent 

Fig. 5.15. Tree illustrating the possible additions to 
S for the network and request of Pig. 5.lh. 

If the documents to be added to S are chosen on the basis of their 
relatedness to the S, Y, and Z sets only, then there is no way of deter- 
mining whether to add x^ Xg, or x^ to S Q in Pig. S.lh. If one cannot 
tell beforehand whether to add x 1> x_, or x,, then perhaps a procedure 
should be devised that would at some later point back up and try another 
•direction' if S becomes inconsistent with the request. In other words, 
if x.j is added to S in Fig. 5.lU, perhaps one could on the fourth itera- 
tion remove a subset containing x from S and add x 1 and x. . Such a 
step would require not only that the procedure be able to know which 
subset to remove but also that it remember all of the previous S sets 
so that it would not fall into a non-terminating cycle. This approach 
is also rejected as not being practical. 

The philosophy adopted for this research project is that for those 
eases where the procedure has difficulty in locating an answer, that the 
user should be coupled into the procedure to guide the process in the 
right direction. This is the reason for the interaction points in the 
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procedure. The user can step in before the addition or deletion of any 
document and over-ride the decision of the procedure by changing the 
request, if he decides the cluster is moving into the wrong area. In 
the case of Fig. £.lU the user could easily obtain the cluster (y^x^) 
by specifying any member of the set (x^x^x^x^) to be uninteresting. 

$.$2 Request Modification 

If the request as initially specified by the user is inconsistent 
or ambiguous, then some additional interplay may be needed between the 
system and the user so that it can be appropriately modified. Let us 
make some general comments about the suitability of the clustering pro- 
cedure for interaction with a user and then deal specifically with the 
problem of what particular type of interaction is needed to resolve 
request inconsistency and ambiguity. 

If a clustering procedure is to be used in close coupling with the 
user, then the process should be divisible into small units of effort. 
Each unit of effort should produce some useful piece of information that 
can be presented to the user and the user should be able to make changes 
to the request between these units of effort. 

The natural unit of effort is, of course, the iteration. The 
information produced by the iteration is the document to be added to or 
deleted from S. The change in the request can be the response of the 
user to the document presented. An iterative clustering procedure, 
therefore, lends itself very well to close supervision by the user. 

There are four interaction points shown for the procedure of 
Sec. 5.3. The initial specification of the request is made at Step 1. 
In Step 6, which immediately precedes the deletion of a document from S 
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(Step 7), the user is given a chance to examine the document to be 
deleted and to modify his request if he wishes to. In Step 13 the user 
is allowed to ask questions and change the request before the addition 
of a document to S. In Step 23 the request is Judged inconsistent and 
the user is again allowed to obtain information from the system and 
modify the request. These four steps provide an interaction point before 
each change to S and on each iteration of the procedure. A description 
of the full range of questions that can be asked by the user at these 
interaction points will be given when the retrieval language is presented 
in Chapter VIII. 

Let us now consider the problem of determining whether a request is 
inconsistent or ambiguous. One test for inconsistency has already been 
given. The last theorem or Sec. k. $ states that in order for two nega- 
tively correlated documents to be in the same cluster they must be posi- 
tively linked to at least one common document (if K<-C ). Let us 

■* max 

present three more theorems pertaining to whether two documents are 
assured of being in a cluster together or not. 

Theorem . Two documents x 1 and x^ can be positively correlated 
to exactly the same documents and negatively correlated to the 
same documents and still not be in the same clusters. 
Proof . Consider the example of Fig. $.16. The documents x. and x_ 
are both positively correlated to x, and x, and negatively correlated to 
Xy However, (x^x^Xg) forms a cluster which contains x and excludes 
Xg. The link between x, and x_ is dotted to show that they can be posi- 
tively or negatively linked and the theorem would still be true. QED 



100 




3/ -10 



Fig. 5.16. Network with x 1 and x_ not in the same cluster. 

Theorem . A document x.. can he positively correlated to every 
document that a document x^ is negatively correlated to (and vice 
versa) and x 1 and x„ can still he in a cluster together. 
Proof . The networks in Fig. 5.17 offer a proof of this theorem. 
The documents x, and x» are in the same cluster (x^x^x, ) and yet the 
values of their links to x~ and x, have the opposite signs. QED 




or 




Fig. 5-17. Network with x, and x~ in the same cluster. 

If one adds the restriction that K^-C , then the ahove theorem 

max 

is only true for positively correlated document pairs. The last theorem 

of Sec. U.5 states that when K<-C two negatively correlated docu- 

max 

ments can occur in a cluster together only if they are positively linked 

to one or more of the same documents. 

Theorem . Two documents x 1 and x~ are assured of always 
"being in the same clusters together if C(x ± x^) is greater than 
the absolute magnitude of the difference in the correlations 
of x.. and x, to every possible subset of other documents. 
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Proof . To prove this theorem let us assume that x.. and x ? are not 
in the same cluster and then show a contradiction. Let us say that x.. 
forms a cluster with the set of documents A which does not include x„ as 
indicated in Fig. $.18. 

f (A)jC(xjA) 




cluster 
Fig. 5.18. Network for proof of theorem. 

Since x. \Jk is a cluster: 
C(x^A)>0 

C[(x£)(aUxJ)]£o 

Rearranging and combining these inequalities-- 
CCx^A) + C(x^)^0 

C(^)<-C(x^A) 

C(x^)<C(x^A)-C(x^A) 

C(x^)<|c(xjA) -C(x*A)| 

This last inequality is in conflict with the part of the theorem 
which states that for any A: 

c(x^)>|c(x^a)-c(4a)| ^ 

These three theorems give some indication of the difficulties 
involved in determining if two documents are in the same cluster on the 
basis of the links from those documents to the other documents of the 
network. The third theorem here and the last theorem of Sec. U.J> would 
help in some cases to determine whether documents can co-occur in 
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clusters, but they have far from general applicability. 

It was, therefore, concluded that there was no easy test which 
could be initially performed to determine if the request was inconsis- 
tent or ambiguous. The tests which were devised consisted of attempts 
to find one or more clusters which satisfied the request and required at 
least as much effort as the finding of an answer for a valid request. 
It was decided that the procedure should not concern itself with the 
problems of request ambiguity and consistency at first but should assume 
that the request is valid and start trying to find the answer cluster. 
If during this process it was decided that the request was inconsistent, 
then the user would be notified of this fact. And if the user was still 
worried about ambiguity after a cluster had been found, then he could 
perform some further searching to satisfy himself that he had retrieved 
what he was after. 

It was further decided that the user should be given the option of 
being able to interact with the procedure on any or all of the itera- 
tions in order to monitor what was being retrieved and in order to 
modify the request if the situation demanded it. Thus a user who sus- 
pected his request to be ambiguous or inconsistent could carefully watch 
what documents were being added to S to make sure that he was obtaining 
what he wanted, while the user who had confidence in the validity of his 
request could let the procedure run to completion unattended. 

The rule which was followed in the design of the procedure of 
Sec. 5.3 was, therefore, to allow the user to interact at any point he 
wished to (and especially in cases where an invalid request was 
suspected), but to never require that he respond before the clustering 
could continue. Thus in Steps 23 and 2li of Fig. 5.6 the request appears 
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to be inconsistent. The user is given the chance of changing his 
request if he wishes. If no change is made, then the procedure picks a 
document to be deleted from Z so that clustering can continue. 

Also in the case of ambiguity the procedure is designed to find the 
most reasonable answer cluster it can for presentation and not to depend 
on the user to clear up the ambiguity. This is the purpose of Steps 12 
and 15 in Fig. $.$. If two clusters with different biases are both 
valid answers to the request, then the one with the smaller bias is 
considered a better selection. Therefore, an attempt is made to make 
the bias as small as possible on each iteration. 

$.$3 Convergence 

A major objective in the design of the clustering procedure is to 
insure that it will always terminate in a finite number of steps for 
every possible document network and every possible request. A procedure 
which occasionally drops into an infinite loop would, of course, be 
completely unacceptable. The possibility of an infinite loop comes 
about because of the fact that the procedure can delete as well as add 
documents to the set S. If on some iterations the set S has the same 
composition as it had on a previous iteration, and if the procedure 
does not remember all of the previous S sets, then a non- terminating 
cyclic behavior is possible. 

In Phase I of the procedure convergence is assured by the following 
theorem. 

Theorem . A procedure is convergent if the only types of 

changes made to the set S being formed are the addition of 

documents positively correlated to S and the deletion of 

documents negatively correlated to S. 
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Proof . The internal correlation of S is increased by the addition 
of a document positive to S. It is also increased by the deletion of a 
document negative to S. Thus C(s) increases monotonically as these two 
types of changes are made to S. This means that C(s) is larger on a 
given iteration than for any earlier iteration. Therefore the composi- 
tion of S must be different on each iteration. Since there are at most 
2 possible S sets (for a network of n documents), there are at most 2 n 
iterations of the procedure before it terminates. QED 

If the bias of the network is changed as it is in Phase II, then 
the above theorem no longer insures convergence. For example, the 
following steps might possibly be taken by a hypothetical procedure in 
trying to obtain a cluster in the network of Fig. 5.19. 




Links not shown are -6 



Fig. 5. 19. Network which may cause a procedure to cycle. 

(1) s - (y i ) 

(2) S 1 =(y 1 x 1 ) C( X;L S )=5 

(3) S^y^Xg) cKx^KO 
(k) Bias =-2 to keep z. negative 

(5) S ;} -(y 1 x 1 x 2 x 3 ) C(x 3 S 2 )-=l 

(6) Bias =-3 to keep z. negative 

(7) Sj^y^Xg) C(x 3 S u )— 1 

(8) Bias »-2 to just keep z, negative 

At this point the procedure returns to Step (5) in a never ending 
loop. 



105 



In order to avoid such cycles Phase II of the procedure selected 
(Sec. 5.3) synchronizes each change in bias with the addition of a 
document to S. If the document being added increases the internal 
correlation of S by k bits, then a decrease in bias is allowed which 
decreases the internal correlation by up to k bits. Thus the total 
internal correlation of S is still increased on each iteration and 
convergence is again assured. 

In the above example Phase II would combine (synchronize) Steps (3) 
and (U) and allow the bias to still be -2 bits. Steps (5) and (6) would 
also be combined but the bias would only be allowed to go to -2.2 bits 
(b.j-C(xJ5)/5). Step (7) would not be taken because x., would not be 
negative. [c(x.jS)«0.6]. 

Thus far we have talked about the effect of decreasing the bias 
on convergence. An increase in bias does not reduce the total internal 
correlation and would not necessarily have to be synchronized with 
additions to the set. For purposes of symmetry, however, bias increases 
are placed under the sane restrictions that bias decreases are. 

Finally, let us consider convergence in Phase III. Bias changes 
that are not synchronized with the addition of a document are now 
allowed, but the bias can change in only one direction. We have already 
shown that the clustering procedure is limited to a finite number of 
iterations for a given bias (by the above theorem). Phase III permits 
only a finite number of bias changes so the total number of iterations 
is finite and we are assured of convergence once more. 

5.51t Minimum Humber of Iterations 

Those steps which are taken to improve the proper selection of the 
document to be added on each iteration should also help to decrease the 
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number of deletions necessary on later iterations. We have already- 
discussed the prohlem of choosing the correct document on a given 
iteration. 



i.i» l j.J ii| JJ ! J<miBl l HWJil ' iP !i ia[!U i JppiWl lll^PW^iP 
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PART BDtEE: SXPSRHOERAL SYStSM 

In the last three chapters the basle coaponeat* 
of the theeretieal aodel were presented. Ihe next 
three chapters describe the experimental eysiea which 
was developed so that the ideas and coneepts of the . 
■©del could be tested in a realistic earironsiaat. 

Ihe four aspects ot the experiaental eyateav 
that will be covered are: 

Chapter ¥1 : Coaputa tiooal Facilities and 
Data Base 

Chapter Til: File Structure 
Chapter VIII: Interaction language 
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CHAPTER VI 
COMPUTATIOHAL FACILITIES AND DATA BASE 

There are two projects at M.I.T. on which this research endeavor is 
highly dependent. Project MAC supplied the computational facilities for 
the experimental phase of the project. The Technical Information Project 
supplied the document collection and data base on which the experiments 
were performed. In addition these two projects provided considerable 
other technical and general assistance. Since the computational 
facilities and data base are essential components of the experimental 
system, they will now be described. 



6.1 Computational Facilities 

The experimental portion of this project was designed for the 
Project MAC time-sharing system 21 . In this section we shall describe 
the MAC system and note some of its features that are of particular 
significance to this project. A more complete description of the 
objectives and characteristics of the MAC system can be found in the 
references^ 2 ' 2 

Fig. 6.1 is an abbreviated diagram of the equipment included in 
the MAC system. Some of the more significant parameters of this equip 
ment are given in Fig. 6.2. All of the equipment shown in Fig. 6.1 i 
physically located at M.I.T. 's Technology Square with the exception of 
the time-sharing consoles. Over 100 of these consoles are located at 
various places on the M.I.T. campus and can be connected to the 7750 



s 



109 



through the M.I.T. telephone exchange. There are also MAC consoles at 
more remote locations. Indeed any TWX or TELEX telegraph station has 
the capability of being connected into the MAC system. Each console 
has a dual purpose. It communicates to the 775>0 what characters have 
been typed on its keyboard and it also types out messages originating 
in the 709k that are routed to it through the 7750. 

In a time-shared computer a number of consoles can be simultaneously 
connected into the system and can independently obtain the services of 
the central processor. A limit is normally placed on the number of 
consoles that can be actively connected at any one time. The purpose of 
this limit is to help insure that those who are connected will be 
promptly serviced. The current limit for the MAC system is 30, but it 
varies periodically as changes and improvements are made in the system. 

One of the core storage banks (bank A) contains the time-sharing 
supervisory program. This program decides which of the users who 
currently want service has the highest priority. The program of the 
highest priority user is loaded into core (bank B) from the disc or 
drum and allowed to run for up to two or three seconds. Then the 
program is removed (swapped) and the new highest priority program is 
loaded and run. 

The IBM 1302 disc is used for permanent or temporary storage of 
programs and data. The data file to be described in the next section 
is stored on this disc as well as programs wnich arrange and structure 
it and allow the user to communicate with it. 
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Tapes, Drums, 
Printer; and 
other peripheral 
equipment 



.Data 
Channels 



Modified 
IBM 709k 
Central 
Processor 



Core storage 
hanks 



Data 
Channel 
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Channel 

and 
file 
control 



IB! 1302 Disc 



IBM 7750 
Transmission 
Control Unit 
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Time -Sharing Consoles 
(IBM 10$0's, Model 35 Teletypes, etc.) 

Pig. 6.1. Project MAC Equipment Configuration. 
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Basic word size 

Core storage operating cycle 
(to read or write 1 word) 

Size of core storage banks A and B 

1302 disc storage capacity 
(80,000 tracks of 1*32 words each) 

1302 Disc scan time 



Transmission rate to and from 
time-sharing consoles 

Physical limit on number of consoles 
connected to 773>0 
(The actual limit is lower) 



36 bits 

2 microseconds 

32,768 words each 
3b. 56 million words 



50-180 milliseconds to 
position on track; 

50 milliseconds to read 
track. 

about 100 bits/ second. 
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Pig. 6.2. Significant Parameters of MAC System. 
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6.2 Data Base 

The basic data needed to implement the theoretical model of Part 
Two is a document collection and a file of partitionings of that 
collection. The document collection selected is described in the next 
section and the final section of the chapter contains a discussion of 
the type of partitioning data that will toe used. 

6.21 Document Collection 

The Technical Information Project at M.I.T. is currently accumu- 
lating a file of information on articles found in the physics periodical 
literature?^ This file covers about 26,000 articles from 2$ different 
journals. Fig. 6.3 lists the names of the Journals and the extent of the 
coverage in terms of volumes. The time period covered for each journal 
is 1 Jan. 1963 to the present. Hote that all of the articles in the 
volumes listed are included. 

One can gain some appreciation of the extent of the coverage of the 
file by noting that the 2$ journals account for over 50^of the articles 
that are abstracted for Physics Abstracts . 

The file is currently growing at the rate of 1$00 articles a month. 
Periodically new Journals are added to the file. Journals to be included 
are selected on the basis of a statistical analysis of their citations. 
This selection criteria is described more fully elsewhere . 

The information extracted for each article is the journal identifi- 
cation, volume and page number, title, author(s), author location(s), 
and coded bibliographic citations. Fig. 6.U is an example of the infor- 
mation available in a given article. Fig. 6.$ summarizes some of the 
parameters of the file. 
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Journal 

1. Annals of Physics 

2. Applied Physics Letters 

3. Canadian Journal of Physics 
ii. Helvetica Physica Acta 

5. Indian Journal of Physics 

6. Japanese Journal of Applied Physics 

7. JETP Letters 

8 . Journal of Applied Physics 

9. Journal of Chemical Physics 

10. Journal of Mathematical Physics 

11. Journal of the Physical Society of Japan 

12 . Nuovo Cimento 

13 . Nuclear Physics 
ill. Physica 

15. Physical Review 

16. Physical Review (Series B) 

17. Physical Review Letters 

18. Physics Letters 

19. Physics of Fluids 

20. Proceedings of the Physical Society (London) 

21. Progress of Theoretical Physics (Kyoto) 

22. Soviet Journal of Nuclear Physics 

23. Soviet Physics - JETP 
2ii. Soviet Physics - Solid State 
25. Soviet Physics - Technical Physics 



Journal 


Volume 


Number of 


Code 


Range 


Articles 


38U 


21-36 


275 


61*6 


2-8 


592 


55 


lil-Ui 


531 


1*3 


36-38 


202 


16U 


37-39 


165 


612 


2 -a 


328 


821 


1-2 


65 


11 


3l*-37 


16U3 


12 


38 -lilt 


3398 


227 


6 


193 


80 


18-20 


759 


17 


27-1*0 


1385 


682 


1*6-75 


1529 


21 


29-31 


359 


1 


129-Ilt2 


3713 


199 


133-iltO 


1791 


ill 


10-16 


1585 


1*9 


3-20 


2880 


799 


6-8 


607 


a) 3 


81-87 


738 


29 


29-31* 


392 


825 


1 


lltli 


669 


16-21 


11*85 


310 


5-7 


81U 


790 


6-10 


898 



178 26,ii7l 



Fig. 6.3. Journals covered by the physics periodical file 

of the Technical Information Project (March 20, 1966). 
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Physical Review 
Volume 136 
Page: 0001 

Spectral properties of a single-mode ruby laser. Evidence of 
homogeneous broadening of the zero-phonon lines in solids 
Tang, C. L. 
Statz, H. 
Demars, G. A. 
Wilson, D. T. 

Waltham, Massachusetts 
Raytheon Research Division 

J001 V102 P1252 J001 V112 PI9I1O J001 V128 P1726 
J001 V133 P1029 J011 V03ii P1682 J011 V03it P2289 
J011 V03ii P2935 J018 V187 P0U93 J018 V195 P0587 
JOhl V006 P0106 J0l*6 VO09 P0399 J6J46 V002 P0222 

Search completed, 257 articles. 

1.99 seconds, 129.1 articles/sec. 

Fig. 6.U. Example of the information available on a given 
article. The last four lines are the coded 
citations (J= journal, V=volume, P=page). 



Number of articles available on the disc 26,1*71 

Time span covered Jan. 1963 to present 
Files key-punched but not currently on the disc: 

(1) Physical Review, Vol. 77-128 ( 1950-1962) 

(2) Journal of Chemical Physics, Vol. 28-37 (1958-1962) 
Average number of articles per track 6.7 
Average number of authors per article 2.02 
Average number of citations per article 12. 
Average number of words per title 8. 

Fig. 6.5. Parameters of T.I. P. data file (March 20, 1966). 
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Initially the information is key-punched on IBM cards. After some 
preliminary editing and correction it is then loaded on the IBM 1302 disc 
of the Project MAC computer. On the disc it undergoes more editing and 
is transformed into the format selected for permanent storage (see 

Sec. 7.1). 

The T.I. P. file has certain features which make it attractive for 
use by this research project. It is of sufficient size and interest to 
attraet serious users. The articles covered contain a substantial 
number of citations which will be shown to be of particular use shortly. 
The generation of the data involves only clerical and mechanical opera- 
tors (i.e. no human indexing or evaluation is required). 

6.22 Partitions 

Some of the advantages to having a retrieval system based on user 
feedback were discussed in Chapter II. A basic objective of this 
project was stated to be the investigation of the feasibility of such a 
system. In Chapter III a particular form that user feedback could take 
was described. Basically it consisted of each interaction of a user 
with the document collection resulting in a partitioning of the docu- 
ments into a set of interesting documents and a set of uninteresting 
documents. 

This type of interaction was described so that one could better 
understand the motivation behind the choice of the sample space, 
probabilities, and other aspects of the theoretical model. Actually the 
theoretical model as developed in Chapters III, IV, and V in no way 
requires that the partitionings on which the probability estimates are 
based be generated by user interactions. Any type of partitioning data 
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could be used, even data that has been arbitrarily contrived. Indeed, 
in the experimental system another type of partitioning was used because 
usage data is not readily available at the present time. 

Let us consider whether a change in the type of partitioning data 
employed by the experimental system will impair its effectiveness in 
testing whether a system based on usage data is feasible. First it can 
be observed that much of this investigation has very little, if any, 
dependence on the particular type of data being utilised. For example, 
the objective of a procedure of Chapter V is to find a cluster of 
documents. Its ability to do this could be examined and tested as well 
on the set of arbitrarily selected partitioning* of a hypothetical 
document collection as on a set of parti tionings generated by the inter- 
action of a real user population with a real library. 

There are some reasons, however, why it is advisable to use a set 
of parti tionings for the experimental system that is not artificial and 
which resembles usage data as closely as possible. For example, the 
utility of the interaction points in the procedure are best tested by 
real users. This, of course, requires a data base which produces 
results that a user would be interested in. Also the overall effective- 
ness of the system to produce useful results can be properly evaluated 
only in a realistic environment. 

With this objective in mind let us now consider what types of 
partitionings are available for the document collection described in the 
last section. There were five types of partitionings that were 
evaluated for this project. They consist of dividing the set of docu- 
ments into two subsets based on whether or not the documents— 

(l) were written by a given author. 
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(2) contain a certain word in their titles. 

(3) cite a given article. 

(ii) were cited by a given article. 

(5) occur in a given subject category. 
Thus by criterion (l) there are as many partitions as there are authors 
in the file, with each author dividing the document file into those 
papers he wrote and those he didn't write. 

A detailed analysis of each of the above types of partitionings was 
conducted on one volume (vol. 128) of the Physical Review . Certain 
tests were also conducted on much larger parts of the document collection. 
Let us summarize the results of these tests and evaluate each of the five 
partitioning criteria. 
(l) Author Partitions. 

Difficulty was encountered in devising an algorithm that could 
determine if two author names referred to the same individual. A sur- 
prisingly large number of the authors were not consistent in the way 
they gave their names. Given names were sometimes supplied in full, 
sometimes represented by an initial, and sometimes left off altogether. 
The method which yielded the best results required an exact match of the 
surname and required that given names either match exactly or match on 
the first letter if one of the names was a single letter (i.e. an initial). 
We at first allowed a missing given name to be a match for anything, but 
this produced too many false matches. We, therefore, required that in 
order for a match to occur the number of given names had to coincide. 
Another difficulty was that roughly half of the authors were the 
authors of only one paper. This produced a large number of partitionings 
with only one document in the subset of "interest", with the consequence 
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that there vere many of the papers that did not co-occur with any other 
paper by this method. 

A thii-d drawback to this type of partitioning arises in those cases 
where an author changes his area of interest and publishes articles on 
unrelated subjects. 
(2) Word Partitions. 

If every title word is allowed to create a partition of the file, 
then practically every document will co-occur with every other document 
because of the common function words like "of", "the", etc. The alterna- 
tive is to try to identify and exclude from use function words. However 
there is no clear distinction between function words and keywords. It is 
fairly clear that certain words should be eliminated if co-occurrences 
are to be meaningful. However there is a large grey area of words such 
as, "effect", "wave", "theory", of "electronic" that in and of themselves 
<3?tate little meaningful linkage, but in combination with other words 
ai*» very significant. The approach adopted for the tests was to elimi- 
nate all words that occurred in over 5-10 £ of the titles. This 
unfortunately eliminated the word "nuclear" while allowing words like 
"between" and "theory" to create partitions. 

A second problem in using word partitions is that there are a 
number of words which differ from each other by only a suffix (i.e. 
superconductor, superconductors, superconducting, superconductive, 
superconductivity). A table was compiled of liO of the more commonly 
occurring suffixes of the title words in the document file. All of the 
words which differed from each other by one of these suffixes were con- 
sidered equivalent in creating partitionings. 
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An even more basic problem involves the use of synonomous words for 
the same concept. Some type of thesaurus would be necessary to link up 
articles with synonymous title words. It was decided that there are too 
many problems involved in the generation (or selection) and use of a 
thesaurus to warrant any effort in this direction in this research 
endeavor. 
(3) Cite-same Partitions. 

When two papers cite one or more of the same papers they are said to 

be bibliographically coupled. A number of studies have been conducted 

28 
to analyze the characteristics of bibliographic coupling . These 

studies indicate that bibliographic coupling constitutes a very meaning- 
ful and important type of relationship between papers, especially in 
those document collections which have a sizable amount of citation infor- 
mation. In the T.I.P. file of Sec. 6.21 there are an average of 12 
citations per article and strict editorial policies make it easy to 
identify the articles that are cited. 
(U) Ci ted-by same Partitions. 

We note from Fig. 6.3 that the documents covered by the T.I.P. file 
have all been written in the last three years. Due to the time required 
to review and publish articles there is usually a period of at least six 
months between the time an article is published and the time citations 
to it begin to appear in the literature. And even after a span of two 

to three years over half of the articles in the Physical Review have 

2? 
still not been cited by subsequent articles in the Physical Review '. 

Thus this type of partitioning will have a very small yield for the 

current T.I.P. file in terms of the number of documents that will occur 

in one or more subsets of interest and in terms of the total number of 
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co-occurrences of articles that will be generated. 
(5) Subject Category Partitions. 

A subject index is published of the articles in the Physical Review . 
Each article is assigned to from one to four categories. These category 
groupings form another type of file partitioning. However, not all of 
the 25 journals have subject indexes and there is no general agreement 
on category headings among the indexes that do exist. Also the categories 
even within a single journal are constantly changing. 

In the beginning we decided to use all five of the above types of 
partitionings for the experimental system with the hope that each would 
add meaningful links to the resulting document network. However, the 
results of the above tests led us to conclude that the use of criterion 
(3) only would result in an adequate set of partitionings, and would 
avoid some of the problems encountered in using the other criteria. The 
final experimental system is, therefore, based on partitionings of type 
(3) only. 
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CHAPTER VII 
FILE STRUCTURE 

Thus far we have described the computational facility on which the 
experimental system operates and the data it uses. Let us now turn our 
attention to the problem of how the data should be arranged and structured 
for storage on the disc or in core. The first section of this chapter 
describes the general approach adopted in this project for the storage of 
data. Then four basic types of files are suggested and various comgina- 
tions of the basic types are proposed for the overall data storage 
system of the project. Certain arguments favoring the overall storage 
system that was selected are set forth. In the last section a brief 
discussion is presented of the type of data structure that would be 
appropriate for the data that has been loaded into the high speed core 
storage for processing. 

7.1 Description and Arrangement of Data 

A few rather general comments on the problem of data storage are in 
order before we launch into a description of the particular types of 
files considered for this project. 

It will be useful in our discussion to hink of the data to be stored 
as forming a tree-like structure. For example, the information file 
generated by the Technical Information Project (Sec. 6.21) can be sub- 
divided into journals. Each of the Journals can be broken down into a 
number of volumes. Each volume in turn consists of some articles. 
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Within an article there are several information types — title, author(s), 
etc. Some of these information types may he further subdivided. For 
example, one can split the author information into the separate authors 
of the article. Fig. 7.1 portrays this tree structure. 

Data file 0--_ 

Journal nodes ( 

Volume nodes ( 

Article nodes ( 

Info, types ( 

Separate authors O ^"O 

Fig. 7.1. Example of tree -like structure of data. 

Each terminal node at the bottom of this tree represents a piece of 
data which must he stored, such as an author's name or a citation. Each 
parent node represents the grouping together of one or more pieces of 
logically related data. For example, a volume node groups together all 
the articles which are contained in that volume. 

Let us first consider a couple of problems involved in storing the 
data represented by the terminal nodes. Much of this data is variable 
in length. For example, titles might vary from 20-200 characters. Two 
ways of handling variable size data suggest themselves. One might use a 
special code or flag to indicate the end of the piece of data or one 
might explicitly store the length somewhere in the file. The latter 
approach was selected since one would always have to perform a search to 
determine the end of the data if a flag were used. 

In addition to knowing how long a piece of data is we must know its 
type or identification. For example, it is not possible, in general, to 
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determine whether a string of characters is a title or an author without 
being explicitly told this fact. If there were one and only one title, 
author, citation, etc. for each article, then the information type could 
be specified by the relative position or order of the pieces of data. 
However, for a given article there may be none or several citations and 
one cannot specify the information type implicitly by the order. 

Thus, in addition to storing the actual data for each terminal node, 
one must give two additional facts— length and type. The storage of 
these two additional facts is useful for the parent nodes in the above 
tree as well as for the terminal nodes. The type of information for a 
given node serves to identify that node from all of its sister nodes 
which are under the same parent node. The length information delimits 
the scope of the node. For example, a volume node would have for its 
identification the volume number, and for its length either the number of 
articles in the volume or the amount of storage occupied by those 
articles. Thus one can summarize the storage requirements of a data file 
by the following two statements. An identification and length must be 
stored for every node in the related tree structure. In addition one 
must store a piece of literal data for each terminal node. 

The last question to be discussed here relates to the actual 
physical order in which data is to be stored. Let us use the example of 
Pig. 7.2 to describe the arrangement selected. One can flatten the tree 
of Fig. 7.2 out into the linear array of nodes shown in Fig. 7.3 such 
that no two connecting lines cross, and such that each parent node is to 
the left of its subnodes. 
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Authors 



Citations 



Fig. 7.2. Example used to show physical order given the data. 




Title 



Authors 



Citations 



Fig. 7.3. Linear arrangement of data in Fig. 7.2. 



This is the physical order in vhich the data is stored for this 
project. For the example of Fig. 7.3 the article identification and 
length are first (node D). This is followed by the code for title 
information, the title length, and the actual title (node T). Rext is 
the code for author information and the length of the author data 
(node A). Then the information on a particular author is given (node A.), 
This includes the author's identification (his position among the 
authors of the article), the length of his name, and his actual name. 
The description for the remaining nodes is similar. 

It may be of interest to note that the above approach is analagous 
to polish prefix notation. Consider the algebraic equation [A • (B+C)]. 
Its polish prefix form, «[A,+(B,C)], is obtained by flattening the tree 
of Fig. 7.i» such that no lines cross. If one equates terminal nodes to 
operands and parent nodes to operators, then our storage arrangement is 
the polish prefix form of the data. 
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(f = %~~^ ==== ^^) 



Fig. T.h. Polish prefix notation. 

7.2 Types of Files 

In this section four basic types of data files are described. An 
overall data storage system might consist of only one of the file types 
or it might include a combination of several types. 

7.21 Raw Data File 

The file of data generated by the Technical Information Project 
(Sec. 6.21) will be termed the raw data file. It currently has the 
'polish prefix 1 structure described above. The precise substructure of 
a given article is shown in Fig. 7-5- The relative amoung of storage 
occupied by each of the types of information is given in the table of 
Fig. 7.6. 



raw data file 
journal nodes 
volume nodes 
article nodes 




Title 



Author (s) Location(s) 



Citation(s) 



Fig. 7.3. Structure of raw data file. 
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article node (ident. and length) - 5 "fo 

title 21 < 

authors ±U % 

author locations 28 % 

citations 32 <£ 



100 f 

Fig. 7.6. Percent of storage occupied by each information type. 

7.22 Inverted Files 

An inverted file is a type of index to the raw data file. For 
example, one might create an inverted author file by extracting from 
each article the authors' names. These names could be alphabetized and 
the duplicates deleted. Such a file would have the structure shown in 
Fig. 7.7. In this figure nodes D . ..D are the identifications of the 
articles written by Author A . 

inverted author file 
author nodes 

articles 

Fig. 7-7. Structure of inverted author file. 

Inverted files have been created for title words, authors, 
locations, and citations. Because of a current lack of storage space, 
the inverted files cover only a part of the total raw data file. This 
partial coverage was found to be sufficient for experimental purposes, 
however. 




tl 11 
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On the basis of the experience gained with these partially completed 
inverted files, it is estimated that inverted files for the full raw data 
file will increase storage requirements by the percentages given in 
Fig. 7.8. 

title word file 17.7^ of raw data file 

author file 15.3* " ' 

location file 1$.0# " 

citation file hi .5% " 

Total 9S.Sfo " 

Fig. 7.8. Storage requirements for inverted files. 

There are certain additional steps that can be taken which will 
probably reduce the additional storage required to only about 10% of 
the raw data file. Thus adding inverted files increases storage require- 
ments by a factor of 1.5>->2.0. It is suspected that the amount of 
storage needed for file inversion is a relatively standard factor for 
most types of information. Certainly the types of information found in 
the test file of this project (title, words, authors, locations, 
citations) varied markedly in their characteristics but still followed 
roughly this factor of two increase. 

Fig. 7.9 shows that the relative amount of storage required for an 
inverted author file decreases as the size of the file increases. The 
leveling off shown leads one to believe that an order of magnitude 
increase in the test file would not significantly change the percent 
increase in storage required for an inverted author file. A similar 
leveling off was found for title words. 
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Inverted Author File Size 

(Based on percent of raw data file size) 
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No. years of 
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in stack 



Fig. 7.9. Storage required for inverted author file. 
(For articles in Physical Review 1959-61* ) 



There is a good theoretical reason why the inverted files should 
require about the same amount of storage as the raw data itself. The 
reason is that the inverted files store the same information as the raw 
data file (except perhaps for the relative order of some of the data). 
Indeed one could reconstruct the raw data file from the inverted files 
by merely collecting together the title words, authors, etc. for each 
article. The one exception to the equivalence of the information found 
in the two types of files concerns order. One cannot determine from the 
inverted word file the order that the words originally had in the titles 
of the raw data file, but only which words belong to each title. Of 
course, some additional provision might be made so that inverted files 
contained order information as well as the article identifications. 
However the point here is that the two types of files should require 
about the same amount of storage. 
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7.23 Linkage Files 

A linkage file contains a description of a document network of the 
type described in Chapter III. The basic information needed to describe 
such a network consists of document node identifications and link values. 

The structure of a linkage file is shown in Fig. 7.10. For each 
document node in the network there is an entry in the filw which consists 
of the identification of the document along with the information on the 
links emanating from the node. The linkage information consists of the 
identifications of the other document nodes connected to the node in 
question along with the values of the connecting links. In such a file 
it is necessary to store only those links for which N j>o with the 
understanding that the value of all other links is K. 

Linkage file : 

Document nodes : 

Linkage node pairs: Ov) 

t t 

Id . ' s of documents linked 

Values of links 
Fig. 7.10. Structure of Linkage File. 

Note that the information on each link is specified in two places 
in a linkage file. For example, the value of C(x.x ) is stored in the 
entry for document x. and also in the entry for x.. This redundancy 
makes it so that once the entry on a given document is located, one 
immediately knows all of the documents to which it is linked as well 
as the values of the links. 
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In an attempt to gain some insight into the size and characteristics 
of linkage files, a test was conducted on one volume (Vol. 128) of the 
Physical Review , Linkage files were created based on each of the five 
types of partitions discussed in Sec. 6.22. The results of this test 
are summarized in Fig. 7.11. 

File Size Percent of total 
Partitioning criterion on (Based on size of possible links 
which links are based Phys. Rev. Vol. 128) for which N .fo 

(1) Authors (estimated) 

(2) Title words 
(for words occurring 

less than 20 times) 

(3) Cite -same 

(h) Cited -by-same 

(Citations to v. 128 
from v. 128 -133) 

(5) Subject Category 175 fo " " " " ±5% 



Fig. 7-11. Table of linkage file sizes for vol. 128 of 
the Physical Review. 



Fig. 7.11 indicates that partitioning criterion (3) generates a 
network in which about 1 l/2 <£ of the links have values other than K 
(i.e. N ,^0). This is for a single volume of the Physical Review . It 
would seem reasonable that this percentage would be somewhat less for 
the total document file. We shall assume in the analysis of the next 
section that approximately 1% of the possible links in the network of 
the total file have non-K values. This means that each document in the 
T.I. P. file is linked to about ( .0l)(26,0OO)=26O other documents on the 
average. 
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7.2k Request - Answer File 

The actual generation of this type of file was never seriously 
contemplated because of the immense amount of processing time and storage 
space that would he required. It is described here because it represents 
an extreme case to which we wish to make reference in the next section. 

A request-answer file contains the answer cluster for each possible 

request. Its possible structure could be represented by Fig. 7.12. 

D ...D in this figure are the documents contained in the particular 
Ik 

answer cluster in question. 

Request-answer file 
Possible request nodes 
Answer cluster nodes 
Document nodes 

Fig. 7.12. Structure of request-answer file. 

Retrieval from this type of file would consist of a simple table 
look-up for the request and then presentation of the associated answer 
cluster. 

7.3 Storage Systems 

The overall storage system selected for this project could consist 
of any combination of one or more of the types of files described in the 
preceding section. For purposes of discussion and comparison let us 
suggest four types of storage systems. The first three were implemented 
and tested to some extent. System (2) is the one that was finally 
selected for this project. 




131 



(1) Raw data file only. 

(2) Raw data file and inverted files. 

(3) Raw data file and linkage file. 

(h) Raw data file and request-answer file. 

The raw data file is included in each of the four storage systems 
so that information on specific articles can be presented to the user at 
any time he wants it. For instance, a user might want to know the title 
and author(s) of an article that is about to be added to the set S. 
This information would be obtained from the raw data file. 

Each of the four suggested data storage systems could serve as 
base for the clustering procedure of Chapter V. There are some signifi- 
cant differences in the characteristics of the retrieval system that 
would result, however. Let us indicate some of the differences by dis- 
cussing four important characteristics of the resulting retrieval systems. 

7.31 Storage Space Required 

Since the raw data file is basic to all four systems, we will 
express storage requirements in terms of the size of that file. It has 
already been noted that the inverted files require about as much storage 
as the raw data file. If we make the assumption that 1% of all possible 
links have non-K values as was suggested in Sec. 7.22, then the linkage 
file for the TIP document collection would be about six times as large 
as the raw data file. If we assume that every request for information 
consists of only two documents of interest and every answer cluster 
contains 20 documents, then a request-answer file would be about 35 
times the size of the raw data file. Much more space would be required 
if larger requests were allowed. These figures are summarized in 
Fig. 7.13. 
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(1) Raw data only 100$ of raw data file 

(2) Raw data plus inverted 200$" " 

(3) Raw data plus linkage 700$ " " 
(U) Raw data plus request-answer . .3500$ " 

Fig. 7.13. Comparison of storage requirements for the four 
types of data systems. 

7.32 Processing Time 

Let us next determine the average amount of processing time that 
would be needed to transform a request into an answer cluster for each of 
the proposed storage systems. By processing time we mean the amount of 
time allocated by the central processor of the Project MAC system to 
running the clustering program. The time spent in swapping the program 
in and out of core storage is excluded. The rario of the real time that 
the MAC user must wait to the processing time varies with the number and 
type of users on the system and can range from one to forty or fifty. 

The time required to access a piece of data on the 1302 disc is 
about l/2 second. This includes both the time spent by the disc control 
supervisor and by the disc in locating and reading a track. Thus the 
request-answer system would require about a second in order to find an 
answer, since very little computational or manipulative work is required. 

For a linkage file system at least 20 accesses to the disc would be 
required (for a cluster of 20 documents). This would involve about 10 
seconds of processing time in addition to some computational time which 
was found to be small in comparison. We pick 1$ seconds as the average 
amount of time required to find a 20-document cluster if linkage files 
are available. 
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The amount of processing time required to find a 20-document 
cluster with an inverted file storage structure has been found to J>0-60 
seconds. This includes 60 or so accesses to the disc and a fair amount 
of manipulation and computation. 

If only the raw data file is available, then one must pass through 
the total data file two or three times looking for documents that are 
linked to the documents in sets Y, Z, and S. One complete pass through 
the raw data file takes 200-300 seconds. Thus the average processing 
time would be on the order of 600 seconds. Fig. 7 .lit summarizes the 
processing time required for each of the four systems. 

(1) Raw data only 600 sec. 

(2) Raw data plus inverted 60 " 

(3) Raw data plus linkage 15 " 
(U) Raw data plus request-answer ... 1 " 

Fig. 7.lli. Average processing time required to find a 
cluster of 20 documents for the four types 
of storage systems. 

7.33 Updating and Editing 

Besides the processing time involved in answering requests there is 
a certain amount of time required for updating and editing the file, 
since it is constantly changing. For purposes of comparison let us 
consider the problem of adding 335 articles (50 tracks or raw data) to 
an existing file of 20,000 articles (3000 tracks). The time required to 
load and structure the raw data file will not be considered since it is 
common to all four storage systems. 

In order to update the inverted files one must extract the 
appropriate fields from the new raw data, sort them into the desired 
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sequences and merge the sorted data with the old inverted files. The 
current programs for doing this would take about 1*00 seconds for the 50 
tracks of data. The time needed for each information type is as follows: 
words - 90 sec, authors - $0 sec, citations - 210 sec, locations - 
50 sec. The time for each process is as follows t extraction - 2$ sec, 
sorting - 150 sec, merging - 230 sec. 

Consider the problem of updating a linkage file with the links based 
on whether or not two papers cite the same paper (partition type (3) in 
Sec 6.22). Updating can be accomplished by the following steps. First, 
extract the citations from the 50 tracks of new articles. Sort these 
citations and compare them with the total raw data file to determine 
which articles are linked to each new article. During this comparison 
process generate a file of information on the new links. Sort this file 
and merge it into the old linkage file. The programs which were written 
to perform this updating process were only tested on small files of 
several hundred articles. Let us extrapolate the results and estimate 
how long it would take to update the linkage file for the case under 
consideration. Extracting and sorting the citations of the 335 new 
articles would take about 100 seconds. Matching the citations with the 
total raw data file would take about 1800 seconds and merging them into 
the old linkage file would require about 1200 seconds for a total of 

U000 seconds. 

The amount of time required to update a request-answer file would 
be more of a guess than an estimate. It would take at least 7000 
seconds to rewrite the file and probably 10 to 100 times more to find 
all the clusters. These figures are tabulated in Pig. 7.l5 for ease in 
comparison. 
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(1) Raw data only sec. 

(2) Raw data plus Inverted 1)00 " 

(3) Raw data plus linkage 1*000 " 
(k) Raw data plus request-answer .... 7000+ " 



Fig. 7.15. Processing time required to update a file of 2000 
articles with 335 new articles for each of the 
four storage systems. 



7.31* Flexibility and Compatability 

So far we have been mainly concerned with how much storage space 
and processing time is required for a system which finds answer 
clusters. Actually the process of finding clusters as proposed in this 
thesis is not considered to be the only retrieval tool which will be 
made available to the user. Rather clustering is looked upon as one 
possible component in a larger, more general retrieval system. It 
follows that the storage structure of the data should not be designed 
with Just the clustering process in mind, but it should be chosen on the 
basis of its utility and adaptability to a large class of retrieval 
functions. 

Even if the data file for the experimental system were to be used 
exclusively for clustering, it would still be useful to make the 
structure selected as general as possible. One reason why this is so 
stems from the fact that any experimental system is generally in a 
constant state of flux and any rigid or specialized data structure may 
soon be rendered obsolete. 

Let us suggest that the following objective might yield a data 
storage structure which would provide an adequate base for a large 
number of different retrieval functions and at the same time strike a 
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suitable compromise between storage and time requirements. 

"The amount of storage required should be minimized 
subject to the restriction that at no time should one have to 
serially search through the total file to obtain a given 
piece of information. By serial search we mean a sequential 
examination of every article in the file." 

7.U Selection of Storage System 

Prom Sec. 7.31 and 7-32 it is evident that no data structure will 
at the same time minimize the processing time and storage space re- 
quired. Some type of engineering compromise is needed. This compromise 
must be influenced by such factors as the characteristics of the compu- 
tational facilities to be used and by the type of retrieval service that 
is to be offered. One must also consider the costs involved in updating 
the file and how often updating is to be performed. The decision is 
further complicated by the fact that the structure selected should be 
compatible with other retrieval functions and flexible to change. 

A storage system consisting of the raw data only requires the least 
amount of storage space and the least effort to update. Its major draw- 
back is in the time required to answer a request. Even now with the 
current file of about 26,000 articles the time required to find informa- 
tion is generally too great to allow for close man -machine coupling. 
And if the file size were to increase by an order of magnitude, a system 
based on this structure would certainly be too slow. 

The linkage and request-answer files have excellent response times 
but require an excessively large amount of storage space and are very 
hard to update. In addition they are designed specifically for the 
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purpose of finding clusters and have little or no real value to other 
retrieval operations. 

The second type of data storage system consisting of the raw data 
file and the inverted files was the one selected for this project. Its 
storage requirements were less than double that required for the raw 
data file alone. The processing time required to find a cluster was 
high, but not so high as to exclude close man-machine interaction, and 
it appears that an order of magnitude increase in the file size would 
not appreciably increase these time requirements. Updating of the 
system could be done on a daily or weekly basis without consuming an 
excessive amount of computational effort. The structure is also useful 
in a large number of other retrieval operations as will become more 
obvious in the next chapter. 

7.5 High Speed Storage Structure 

■■ So far in this chapter we have discussed how the data should be 
structured for permanent storage on the disc. A related problem con- 
cerns the form the data should take once it has been selected for 
processing and is loaded into high speed core storage. 

The approach that was used in the earlier versions of the experi- 
mental system was to convert the data to a "list" structure as it was 
loaded into core. This involves associating one or more address 
pointers with each piece of data. The pointers preserve the original 
sequence of the data without requiring that it occupy contiguous loca- 
tions in memory. One of the major advantages of such a structure is the 
relative ease with which the data can be re-arranged and with which 
particular pieces of data can be added and deleted. Some of the 
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programing languages that have been developed to facilitate the creation 
and manipulation of list structures are COMT, LISP, SLIP, and SHOBOL. ' 

It was later decided that the added flexibility obtained through 
the use of list structures was not, in general, needed for library-type 
data that remains relatively fixed. Indeed the processing time required 
to reformat the data into lists was considerable. Therefore the approach 
that was finally adopted was to leave the data in core in the same form 
that it was on the disc. 

It is actually easier to perform some of the operations needed in 
the formation of a cluster on this disc structure than it is to do them 
on the equivalent list structure. Take, for example, the calculation of 
the Nji's. For the partitioning criterion selected this would involve 
the comparison of two tables of citations. The most efficient way that 
has been found to do this is to have the citation codes of each article 
in numeric order on the disc, and to make a single synchronous pass 
through the two tables tallying the number of matching entries. The 
time required to do this match if the data has a list structure would 
probably at least double. There are also certain other operations (e.g. 
binary or logarithmic searches) for which a list structure is not well 
suited . 

For the final version of the experimental system a rather simple 
storage allocation system was adopted which kept track of the available 
free core storage. Through this system blocks of storage could be 
allocated, changed in size, or freed up for other uses. Reference to 
each block was through a numeric code so that the actual address of the 
block could change. This made it so that all the free storage could be 
kept in one contiguous block. Data from the disc was loaded into these 
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blocks of storage and processed there. 

The S, Y, and Z document sets were also placed in blocks obtained 
from the storage allocator. It was later decided that this was a 
distinct disadvantage to the system because the sets were constantly 
changing and should have had the flexibility available from a list 
structure. 
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CHAPTER VIII 
INTERACTION LANGUAGE 

The description of the experimental system is now almost complete. 
The clustering procedure which is used in answering requests has been 
defined in Chapter V. The computational facilities and data base on 
which the system operates have been described in Chapter VI. In Chapter 
VII the way the data is structured was explained. 

The one aspect of the experimental system that has not been covered 
concerns the interface between the user and the system. In this chapter 
we will describe the language which permits the user to communicate and 
interact with the system. 

8.1 Background to Language 

As a way of introducing the language we will present in this 
section some of the general design objectives that were selected for the 
language and an example of a typical interaction using the language. 

8.11 Design Objectives of Language 

The first retrieval language developed for this project was 
designed specifically for clustering and bore little resemblance to the 
language used by the Technical Information Project programs in performing 
the more conventional matching functions (author, citation, and keyword 
searches, bibliographic coupling, etc.). It was found to be inconvenient 
and confusing to have to shift from one program and one language to 
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another program and another language every tine one wanted to shift from 
a clustering request to a T.I. P. request and vice versa. It was decided 
that the same general language should be used for both functions. This 
goal is related to the idea expressed in the last chapter that the 
clustering function should be considered a component of a larger re- 
trieval system (Sec. 7.3b). Not only should the data structure be 
designed for the larger, more general system, but the retrieval language 
should also. In the remainder of the chapter the clustering and matching 
functions will, therefore, be treated equally. 

In addition to having adequate expressiveness for the current 
clustering and T.I. P. commands, it was considered desirable that the 
language be flexible enough so that it might be easily extended to other 
types of retrieval operations. 

A second objective of the language is that it should be easy to 
learn, use, and remember. It was decided that if the vocabulary and 
syntax of the language resembled normal English it would be easiest to 
learn and remember. However, it was found to be rather tedious after a 
while to have to type a complete English sentence for each request. An 
abbreviated version of the language was, therefore, developed for the 
experienced user which allowed much of the vocabulary to be abbreviated. 
The abbreviated version was such that one could make a smooth transition 
from the full English request to the abbreviated request as he became 
more familiar with the system. An example of a complete request and the 
equivalent abbreviated request follow. 

"Print the authors and locations of all the articles cited by the 
article, Physical Review, volume 13$, page 3." 

"p art loc of art cited by 1 13$ 1." 
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A third goal of the language is that it be simple enough to process 
efficiently and quickly. Even a rather complex request in the language 
that was adopted takes much less than a second of central processor 
time to interpret. 

8 . 12 Example of Language 

In Fig. 8.1 is an example of an interaction that might occur 
between a user and the system. The lines that the user types are under- 
lined. First he initiates the MARS (Machine Aided Retrieval System) 
program. We assume that the one fact the user knows is that he is 
interested in something about Langmuir probes. He could just as well 
have known an author or paper that interested him or perhaps a combina- 
tion of these. 

In the first command he asks for a list of those articles containing 
the word, "Langmuir", in their titles. Let us say that after examination 
of the list produced, the user decides that the papers by three of the 
authors are the most interesting. He now asks for all papers written by 
these three authors (that have not already been retrieved). 

Next we assume that the user selects two of the papers as of 
particular interest and wishes to form a cluster around them. Further 
he decides that one of the papers is definitely not what he wants and 
he, therefore, specifies that it is not of interest. A close interaction 
sequence follows with the system presenting papers that are about to be 
added to or deleted from the set S and the user deciding which are of 
interest and which are not. 

Finally a cluster is formed and the user stores it on the disc for 
future reference. He then analyzes its characteristics by making various 
lists of frequency counts. 
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RESUME MARS 
tf 131t8.it 

PRINT THE TITLES AMD AUTHORS OF ARTICLES COHTAIHINO THE WORD.'LAHQMUIR' . 
17 ARTICLES IN SET 1. 

PHTSICA 
VOLUME: 30 
PAGE: 182 

STUDIES OF THE DYNAMIC PROPERTIES OP LAHGMUIR PROBES I: MEASURING METHODS, 
CARLSON R. W. 
OKUDA T. 
OSKAM H. J. 

HUOVO CIMENTO 
VOLUME: 29 
PAGE: U87 

EFFECT OF A R.F. SIOMAL OH THE CHARACTERISTIC OF A LAHOMUIR PROBE- 
BOSCHI A. 
MAOISTRELLI F. 



PRIHT THE TI TLES AMD AUTHORS OF ARTICLES BY R. W. CAR LSON OR T. OKUDA OR 
H. J, OSKAM BUT HOT IN"SBT"iT ^~~ ' — 

6 ARTICLES IH SET 2. 

JOURNAL OF THE PHYSICAL SOCIETY OP JAPAH 
1T0LUME: 13 
PAGE: 1212 

DISTURBANCE PHEHOMEHA IH PROBE MEASUREMENT OF IONIZED GASES. 
OKUDA T. 
YAMAMOTO K. 




PRINT FOR D ECISION THE TITLES AND AUTHORS OF ARTICLE S RELATED TO PHYSICA 

m,..-°. > . P :.^ 2 / ™ *: rc SffAL SOCIETY OF J APA*. V. 13. P. 1212. BUT NOT , 
NUOVO CIMBNTO, V. 29, P. it87. 

TO BE ADDED: 

PHYSICS LETTERS 

VOLUME: 11 

PAGE: 126 

THE PLASMA RESOHAHCE PROBE IN A MAGNETIC PIELD. 

CRAWFORD F. W. 

HARP R. S. 

IS THIS OF INTEREST: YES 

TO BE ADDED: 

END, 



SAVE SET 3. 

FILE SET 3 CREATED. 

END. 

PRINT THE FREQUENCY OF AUTHORS IN SET 3. 

23 AUTHORS IN SET 3. 

k OKUDA T. 

3 CARLSON R. W. 



END. 



Fig. 8.1. Example of possible user interaction with data 
using retrieval language. 
(Lines typed by user are underlined.) 



8.2 Description of Language 

Two methods of describing the retrieval language have been 
selected. In the first the syntax of the language is described by 

means of a finite state (sequential) machine. In the second the syntax 

37 
and vocabulary are defined by means of Backus normal (ALGOL 60) notation. 

The equivalence of these two descriptions is also shown. 

8.21 Finite State Machine Description 

There are a number of different methods that could be used to 
describe the retrieval language that was developed for this project. 
Perhaps the most appropriate way to describe the syntax of the language 
would be to present the same table that is actually used by the inter- 
pretive part of the retrieval system. Fig. 8.2 is the syntax table 
which has been extracted from a program listing. It is a tabular 
description of a finite state machine . The first column contains the 
identifications of the various states. Column two pertains to one of 
the languages used to write the system (it is the name of a MACRO in FAP) 
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and is not pertinent to our discussion here. The third column contains 
the valid state transitions that can occur. For example, the entry 
(V,2) for SI means that the machine will change from state SI to S2 if 
the input signal is V (verb). 

)(A,1)) 

)(l,U)(L,8)(E,10)(X,2)(A,2)) 

)(A,3)) 

)(P,6)(X,M(A,U)) 

)(A,5)) 

)(A,6)) 

)(X,7)(A,7)) 

)(E,10)(X,8)(A,8)) 

)(x,9)(A,9)) 



SI 


STATE 


S2 


STATE 


S3 


STATE 


Si 


STATE 


S? 


STATE 


S6 


STATE 


S7 


STATE 


S3 


STATE 


S9 


STATE 


S10 


STATE 



((V,2 


)(x,i) 


((V,2 


)(C,3) 


((V,2 


)(X,3) 


((N,U 


)(c,5) 


((N,)4 


)(x,5) 


((N,7 


)(X,6) 


((P,6 


)(L,8) 


((L,8 


)(C,9) 


((P,6 


)(L,8) 








Fig. 8.2. Finite state machine description of syntax 
of retrieval language. 



Fig. 8.3 is the state diagram for the machine of Fig. 8.2. We have 
left off the self loops on each state due to the X and A inputs to keep 
from cluttering up the diagram. Also not shown is the sink state which 
the machine enters when the input sequence being analyzed has an invalid 
syntax. For example, if the machine is in state S„ and the input signal 
is a P, then the sink state is entered. The initial or starting state 
of the machine is S, . The final or accepted state is S..-. Thus an 
input sequence is considered to have an acceptable syntax if it trans- 



forms the machine of Fig. 8.3 from S., to S 1Q . 
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Fig. 8.3. Finite State Diagram for the Table of Fig. 8.2. 
(Transitions not shown go to an error or sink 
state. ) 



The input symbols of Fig. 8.2 and 8.3 represent classes of words. 
Fig. 8.U gives the general titles and some examples of the classes. The 
interpretive procedure first classifies each word in the input statement 
into one of the classes and then checks the syntax by the Table of 
Fig. 8.2. In Fig. 8.5 we present a specific example of an acceptable 
and an unacceptable statement. 



Input Symbol 


Class Name 


Specific Examples 


V 


Verbs 


print, count 


H 


Nouns 


article, title 


P 


Prepositions 


by, of 


A 


Adjectives and Adverbs 


first, last 


C 


Conjunction 


and, or 


X 


Filler Words 


the, a 


L 


Undefined (literal) words 


Jones, laser 


E 


Terminator 


.(carriage return 



Fig. 8.U. Classes of Input Symbols. 
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Statement: Count the articles by John Jones. 

Word classes: V X N P L L E 
States traversed: S 1 S„ S~ S, S^ Sg Sg S,. 

Statement: Print the titles of articles and. 
Word classes: V X N P N C E 
States traversed: S, Sp S S. S, &, Sink State 

Fig. 8.5. Example of statement with acceptable syntax 
and statement with unacceptable syntax. 

Let us comment briefly on the purpose of each state in the diagram 
of Fig. 8.3- Preliminary to doing this it should be noted that there 
are generally three main parts to an acceptable statement (request): 

(1) Verb (states S„ and S,) 

(2) Direct object (states S. and S^) 

(3) Modifying phrase (states S^ S_) 

State S. is the starting state of the machine. State S„ requires that 
each request begin with a verb describing what the system should do. 
The verb can be either simple (e.g. print) or compound (e.g. count and 
save). State S, excludes the possibility of a double conjunction 
between elements of a compound verb (e.g. print and or store). It also 
prevents the verb from ending in a conjunction. 

State Si requires that the next part of a request be a list of one 
or more nouns signifying the type of information that is to be produced 
by the system. This can again be simple (e.g. title) or compound (e.g. 
title, authors, and locations). State S^ has a purpose similar to S,. 

The last part of the request is the modifying phrase which 
contains the structure of the articles and other entities that are 
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specified by the user in making the request. States S/ and £L allow 
the request to have a complex structure with several levels of preposi- 
tional phrases modifying other phrases. For example, one could find 
the co-authors of a given author by the request: "Find the authors of 
articles by John Jones." 

States So and S Q allow the user to specify some logical combination 
of a number of specific fields. For example: "Print the articles by 
John Jones and Robert Smith but not Joseph Adams." 

The E transition from S ? to S in is so that certain commands will be 
accepted that consist of a verb only. The LE transition between S ? and 
S, allows for an abbreviated mode of reference to certain data (e.g. 
Print set 3.). Adjectives and adverbs can occur anywhere in a request 
and can modify verbs, nouns, etc. 

8.22 Backus Normal Description 

Let us leave the finite state description of the syntax of the 
language now and provide a more conventional description. The statements 
of Fig. 8.6-8 constitute the Backus normal (ALGOL 60) description of 
the language. In this notation "::=" means "is defined to be", " |" 
means "or", and "^ ^" encloses the defined elements of the language . 

Two additional explanations are necessary for the Backus normal 
description of Fig. 8.6-8. All elements (words) in the statements are 
separated by one or more word separators (blanks, commas or periods) 
except in the definitions for ^word^ and ^integer> where the characters 
have no separation. Adjectives, adverbs, and filler words can occur at 
any point in a request, but this fact is omitted from the description to 
simplify its statement. 
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(request^ : := < compound verb ^ <( compound object)> <( compound modifier)* 
<( terminator)* ^abbreviated command^ 

^compound verb^ ::= ^verb)* | ^compound verb^> <(verb)> | 

^compound verb ^ ((conjunction^ ^verb)> 

^compound object^ : := <noun)> | ((compound object)* < noun^ j 

<(corapound object^ <( conjunction > <(noun^ 

((compound modifier)* : := <( modifying phrase )> | ((compound modifier)) 

^conjunction ^> <( modifying phrase^> 

((modifying phrase> : := <( preposition> <( compound literal^ J 

((preposition)* ((noun)> ((modifying phrase)* 

<compound literal > ::= <" literal)* | (compound literal)><(con junction > 

<(literal> |<C compound literal)* <(literal)> 
((abbreviated command > ::= ((compound verb> <(terminator> | 

(compound verb^ ( literal^ ( terminator )> 



Fig. 8.6. Backus normal statements describing syntax 
of language. 
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<vocabulary word) ::= <verb>|<conjunction>|<noun)|(preposition>| 

.(adjective) l(adverb)|<filler)| (terminator) 

(yerb> :: = <find verb>|<print verb> l<delete verb>|<save verb>| 
(read verb> l<other verb> 

<find verb> : := count | find | fetch | f | get | g | keep 

(print verb) ::= list | print J p 

(delete verb) : := delete 

<save verb} ::* dump | save | store 

(read verb) ::= read 

(other verb> ::= load | return ( search | trace | unload J yes | no ( skip 

(conjunction) : := and | and not | but not \ not | or 

(noun) ::= (article noun> |<title nouri>|<word noun>|<author noun> | 
(location noun) |(citation noun> 

/article noun> ::= art | article | articles | doc | document | documents | 

id 1 ids | identification | identifications | paper J 
papers 

(word noun> ::= keyword | keywords | word | words 
(author noun> ::= aut J author | authors 
<location noun> ::= loc | location | locations 

<citation noun> :: = biblio | bibliography | bibliographies | cit | citation | 

citations | ref | reference | references 

/preposition) ::= <article preposition>|<word preposition} | 

(author preposition) |<location preposition) | 
<citing preposition) |<cited by preposition) | 
(set preposition) \ (clustering preposition) 

<article preposition) ::= of | used by 

(word preposition > : := contain | contains \ containing | use | using 

(author preposition) ::= by 

(location preposition) ::= at 

(citing preposition) ::= cite (citing 

(cited by preposition) ::= cited by 

^set preposition) ::= in 

(clustering preposition): := related to | related by authors to | 

related by citations to 

(filler) : := a | all I all of | an | any J any of | are | been | each | every | 
have | is | the | this | these | those | were | written 

(adjective) ::= f irst | last| most recent 
(adverb} ::= by frequency | for decision 
(terminator) :: = • +> (?is a carriage return) 

Fig. 8.7. Backus normal statements describing vocabulary of language. 
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{literal} ::= {article literal} | {word literaJ>|<author literal>| 

{Location litera£| {set literal} 

{article literal} :: = {journal {volurn^ {page} 

<word literal} ::■= {literal string} 

{author literal") : : * <literal string} 

<location literal) ::= {literal Btring> 

{set literal) ::■= set {integer} 

{journal} :: = {Journal name} ({alphabetic code) | {numeric code} 

{journal name) ::» Phys. Rev. J Physical Review J ... | Physics of Fluids 

{alphabetic code} ::» phyrev | phyreb I ... I spjetp 

{numeric code} : : «■ {integer} 

{volume} : :» <word> ^.nteger> J {integer} 

{page} ::- {word} ^.ntege^ I {integer} 

{literal string} ::«■ {word string} |tyord string)} 

(the first word string in this definition cannot include a 
vocabulary word . ) 

^rord string} A K {word} |^rord string} {word} 

{word} : := {character} | Character) {charactei} j ^haracte^ {character} 

^:haracter}| . . . 

<integer> ::- {digit} |<£igit}{digi^>| {digit} <£igii>{digit}|... 

<character) ::= {letter) |{digit}| {special character} 

{letter} : s- a | b | . . . I z 

{digit} ::- jl | ... | 9 

{special character} ::« -l/l*!*! 5 !*!"- 

{word separator) : :■ (blank) | , I . 

Pig. 8.8. Backus normal description of literals. 
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8.23 Equivalence of Descriptions 

The equivalence of the Backus normal definition of Sec. 8.22 to 
the finite state diagram of Sec. 8.21 can be shown by successively 
applying the four transformations of Fig. 8.9 to the statements of 
Fig. 8.6. Fig. 8.10 is a brief outline of the steps which would be 
taken in this process. One is referred to the literature for an 
explanation of the additional concepts (e.g. non-deterministic machines, 
equivalent states, etc.) introduced in this Figure. 



Backus Normal Finite State 



B 



(1) A::=B|C CHr-O — » OCZ^D 

C 

(2) A::=BC 0—-0 —> O^— *0-^~ O 

A 

(3) A::=AB|C O a > — > O^ — * 

B 

(h) A::=BA|C ( >T^° ~> /V^"* 

Fig. 8.9. Rules for transforming Backus normal statements 
to finite state diagram. 



8.3 Interpretive Algorithm 

In this section we will describe how the retrieval system inter- 
prets and processes the language of Sec. 8.2. The discussion will 
initially cover some general aspects of requests and of the words that 
they contain. Sections 8 .32-8 .3U will describe the various functions 
that requests can perform (the verb), the types of data that can be 
generated as output (the direct object), and the structure that 
specifies the actual request (the modifying phrase). 
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Expansion of Rt 
© R ' ® 





Rules 1 
and 2. 



Reduction 
to deter- 
ministic 
equivalent 
(2a,2b 
combined ) 



Expansion of (CM): 
C®(MP) 



Rule 3 




Rule it 




Rule 3 



Reduction to 
deterministic 
machine. 
6e-(6a,6b) 

6f-(6c,6d) 
7a-(9a) 



Substitution for (CM). 
(Hull symbol X 
necessary for 
H isolation.) 



Reduction to determin- 
istic machine: 

7b«(l»b) 



Combination of equiv- 
alent states: 

(6e,6f) 

(7a,7b) 



Fig. 8.10. Outline of steps proving equivalence of Backus-normal 
and finite state descriptions. 
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8.31 Vocabulary and Literals 

A request consists of one or more lines of characters that the user 
types on his time -sharing console. The maximum length of a request is 
currently ljOO characters. The end of a request is indicated by a period 
followed by a carriage return. The request character string is initially 
broken up into words. Words are defined to be character strings 
separated by blanks, commas, and/or periods. There are two types of 
words : those found in the vocabulary table and those not found in the 
table. All words not found in the table are called literals. Their 
function is to specify the particular authors, title words, citations, 
etc. that the user wishes to designate in defining his request. The 
vocabulary words are for indicating the function and structure of the 
request. 

In some cases a user may want to use one of the words in the 
vocabulary table as a literal. For example, he may want to find all 
titles that contain the vocabulary word, "store". To do this he can 
explicitly specify the word as a literal by the use of the literal mark, 
" ' ". For the above example the user would say, "print the titles of 
all articles containing 'store' ." 

Note that the retrieval system makes no distinction between lower 
and uppercase letters. The T.I. P. file does not contain information on 
whether a letter is lower or upper case either. 

8.32 Available Functions 

The verb part of each request specifies the particular operation or 
operations that are to be performed. For example, if the user wants the 
results of the search to be printed on his time-sharing console, he 
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would use the verb, "print". There are currently twenty- three verhs in 
the vocabulary and thirteen different functions that they specify. Let 
us describe five of the thirteen functions. 

(1) Scratchpad Storage 

One of the most useful features of the retrieval system is its 
scratchpad storage capability. Basically this involves the storage in 
core memory of various kinds of data for later reference. For example, 
one can create in scratchpad storage a file of all articles written by a 
given author by the command, "Find the articles by John Jones." After 
creating the set, the system tells the user its size and identification 
number (e.g. k articles in set 3). Later on the user could find out 
what articles cite articles by John Jones by the request, "Print the 
articles citing articles in set 3," or Just "p art citing set 3»" 

Each data set in scratchpad storage is currently homogeneous with 
respect to the type of information it contains. In other words one 
could not create a set that consisted of both author and citation data. 

Some of the verbs that create sets in scratchpad storage are: 
count, find, fetch, f, get, g, and keep. These words are completely 
equivalent so far as the system is concerned. 

(2) Console Print-out 

The verbs that will cause the data in question to be printed on the 
user's console are list, print, and p. A scratchpad set will also be 
automatically created (if the output is homogeneous and if it isn't 
already a set). 

The first line of each print-out consists of the number of items 
that will follow. Thus the user is always aware of the ultimate size of 
the listing and can interrupt it if he wishes. 
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(3) Delete Data Sets 

Sets or groups of sets can be erased from scratchpad storage by 
commands such as "Delete set h" , "Delete all sets." 
(U) Save Data Sets 

Any scratchpad data set can be placed on the disc for permanent 
storage by the verbs save, store, or dump. The form of the command 
would be: "Save set 2." 
(5) Read Data Sets 

Data sets that have been stored on the disc by the above command 
can be written back into scratchpad storage by commands of the type: 
"Read set 6." 

The functions of some of the verbs can be modified by adverbs or 
adverbial phrases. Let us describe two such modifications that have 
been implemented. 

(1) Frequency Lists 

The print verb can be modified to list items in terms of their 
frequency of occurrence in the data from which they are extracted. For 
example, the command, "Print frequency of title words in Phys. Rev. 
Vol. 132." would produce a list of the number of times each word appears 
in the titles of articles in Phys. Rev. Vol. 132 (most frequent first 
and alphabetical within the same frequency). 

(2) Decision Print-outs 

The print verb can also be modified so that there is a pause after 
each item is printed out to allow the user to decide upon and respond to 
the item. This would be the command used, for example, by a user who 
wished to be coupled into the clustering procedure. For the command, 
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"Print for decision the titles of articles related to Nuovo Cimento 
Vol. 30, page 1.", the procedure would pause after printing the title of 
each article about to he added to or deleted from the set S and allow 
the user to place the article in the Y or Z set if he wished. 

8.33 Data Generated 

The second part of the request is the direct object of the verb. 
It is a list of the types of information (nouns) that the user specifies 
he wants in the system's response to the request. Fig. 8.7 indicates 
six different types of nouns that can be used for this purpose (article, 
title, word, author, location, and citation nouns). The correspondence 
of these words to the various types of data found in the T.I. P. file is 
fairly obvious. Any combination of these types of data can be printed 
on the user's console, but only one type can be put in scratchpad 
storage for a given request. The form of the data as it is printed on 
the console is shown in Fig. 6.1i. The data placed in scratchpad has the 
single level structure indicated by Fig. 8.11 (see Sec. 7.l). 

Set Node: ^ 



Author Name Nodes : 

Fig. 8.11. File structure of data in scratchpad storage. 

8.3U Request Structure 

The third and final component of the request is the phrase which 
modifies the direct object of the verb. It consists of a series of 
prepositional phrases which either modify the direct object itself or 
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else modify the noun object of one of the other prepositional phrases. 
Let us define the structure of this modifying phrase and describe how it 
is interpreted. 

8.31*1 Determination of Literal Type 

The object of each preposition can be a noun or a literal. In the 
case of a literal some indication must be given of its type, since there 
is no intrinsic difference between most of the types (e.g. a word 
literal might look exactly like an author literal). The first preposi- 
tion to the left of a literal is currently used to determine the type. 
Fig. 8.12 lists the literal type which is assumed to follow each preposi- 
tion. For example, any word not in the vocabulary that follows the 
preposition, "by", is assumed to be an author's name. 

The one exception to this is the set literal which can be the 
object of any preposition. It is distinguished from other literals, not 
by the preceding preposition, but by the word, "set", at the beginning 
of the literal. 

There is one additional way of indicating the literal type which has 
been partially implemented but is not described in Sec. 8.2. This 
involves the use of a noun between the preposition and the literal. An 
example of this would be the phrase, "with the word, phonon", which is 
acceptable and identical to the phrase, "using phonon". A change such as 
this would become essential if the number of data types increased sub- 
stantially, since there would not be enough suitable prepositions. 
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Preposition Type type of Object 

<article preposition^ Article noun), Citation noun), Article literal 

<word preposition> <word noun) , <word literal)* 

^author preposition) <£uthor noun) , <author literal 

^Location preposition) location noui$> , ^Location literal 

<piting preposition) Article noun), Citation noui^ , Article literal 

<£lted by preposition) Article noun), Citation noun), Article literal) 

^et preposition> ^»et literal) 

flustering preposition) Article noun), Citation noun), <$r tide literal 

Fig. 8.12. Valid types of objects for each preposition class. 
(Set literals are valid objects for any preposition 
and are not listed.) 

8.3lt2 Form of Literals 

After the general type of information that a literal contains is 
determined, one must next interpret what specifically is meant by each 
iiteral. To this end let us describe the conventions which govern the 
form that each type of literal can take. 

Article literals generally consist of three parts: the Journal, 
volume, and page. The journal can be specified by using the full title, 
the standard abbreviation of the title, or a special alphabetic or 
numeric code. The volume and page number can each consist of an integer 
or a word followed by an integer. Some examples of acceptable article 
literals are: 

Physical Review, volume 128, page 1 

Phys. Rev., vol. 128, p. 1 

Phyrev v 128 p 1 
1 128 1 
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The volume and page number nave been made optional so that one can 
refer to all articles in a given journal or in a given volume by a 
single literal. 

Each word literal should consist of a single word. If one wishes 
to search for a phrase of two or more words, he should use two or more 
literals (e.g. "print titles of articles using thin and film."). 

A word literal represents (matches) not only the word in the file 
which is identical to it, but also all words to which it is the prefix. 
Thus the command, "Get the art using supercon." would get all articles 
with titles containing superconductor, superconductivity, etc. 

If one does not want prefix matching, he can use a "*" to designate 
an explicit blank. The command, "p art using laser*.", would not 
produce those articles whose titles contain the word, "lasers". 

Author literals are to be written with the surname last (e.g. 
John H. Jones). A literal that consists of a surname only will retrieve 
all authors with that surname. A literal containing one or more given 
names will match those author names in the file for which the surname 
matches exactly and for which every given name in the literal is the 
prefix of the corresponding given name in the file. Thus, "p art by Al 
Jones.", would print all articles by "Albert Jones," "Alden Jones", 
and "Allen S. Jones". 

Location literals must be given in a request exactly as they are 
found in the data file if retrieval is to be accomplished. 

Set literals consist of the word, "set", followed by the identifica- 
tion number of the desired set. 
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8,3li3 Action Initiated by Each Preposition 

Each prepositional phrase in a request initiates a file search 
(table look-up) in an appropriate data file. If the object of the 
preposition is an author, location, word, or citation literal, then the 
file used is the corresponding inverted file. If the object of the 
phrase is an article literal then the raw data file is used. 

The information obtained from an inverted file is, of course, 
always a list of article identifications. The type of information 
obtained from the raw data file is determined by the type of noun that 
is modified by the prepositional phrase in question. For example, in 
the command, "Print authors of Phys. Rev. 128 1.", the table look-up 
for the "of" preposition would be in the raw data file and would select 
the author information. 

The set of articles (or other data) produced by each table look-up 
can in turn be the object of another preposition and another table look- 
up. Consider the request, "Print the titles of articles cited by 
articles by John Jones." The procedure first looks up the articles by 
John Jones. Then it finds the articles cited by the articles by John 
Jones. And finally it retrieves and prints the titles of the articles 
so obtained. Bote that each of the three prepositions, of, (cited) by, 
and by initiated a particular type of file search. 

There are two types of prepositions that do not cause a table look- 
up in a file. A clustering preposition performs more than just a table 
look-up. The procedure of Chapter V is executed, resulting in the set 
of articles of the appropriate cluster. 

The set preposition does not initiate a file search but produces 
the input set as its output (a unitary transformation). Thus in the 
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request, "Print the title of articles in set In", the preposition, "in", 
merely passes on the articles in set k to the next preposition, "of", 
which looks up their titles. 

8 .3Ut Logical Operations 

The results of the table look-ups (or clustering) for two or more 
prepositional phrases can be combined by the standard logical operations 
(and, or, not). Consider, for example, the request, "Print the articles 
by John Jones and by Robert Smith or by Charles White but not by David 
Allen." The logical operation performed can be represented by the 
equation [ ((J.J.fiR.S. )L)C.W. )PlD.A. ] where the initials J.J. stand for 
the set of papers by John Jones and D.A. is the set of papers not 
written by David White. It will be noted that the logical operations 
are performed from left to right through the request in the same 
sequence in which the user typed them in. It was thought that this 
might be a more useful convention for a system that is closely coupled 
to the user than to have a parenthesized system with a hierarchy of the 
types of operations to perform first (as in MAD,FORTRAN, etc.). 

Any arbitrarily complex logical structure can be obtained by this 
kind of approach (without having to use parentheses) if one creates sets 
in scratchpad storage. For example the set of articles represented by 
the logical expression, (J.J.QR.S. ) |J(C.W.nD7A7), could be created by 
the sequence of commands. 

Find art by John Jones and by Robert Smith. 

3 articles in set 1. 
Find art by Charles White but not by David Allen. 

1 article in set 2. 
Print art in set 1 or in set 2. 
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There Is one logical structure that is not allowed in the system 
since it makes little sense in retrieval applications. This is the 
negation of any of the operands of the "or" operation. Consider the 
command, "Print articles by John Jones or not by Robert Smith." If 
this means (J.J.IJB.S.), then the articles requested vould include most 
of the file since Robert Smith would have authored at most 20-30 articles. 

The conjunctive operation between each pair of prepositional 
phrases must be explicitly stated. One could not say, "Print art by 
John Jones, by Robert Smith, and by Charles White." However, one can 
omit the prepositions after the first one (e.g. "Print art by John Jones 
and Robert Smith."). 

8.3U5 Selection of Predecessor 

The next problem to be considered is the determination of what 
noun(s) each prepositional phrase modifies (its predecessor). Consider 
the request, "Find the articles citing articles by John Jones and cited 
by Physics of Fluids, v. 7, p. 1." The last phrase, "cited by..." can 
conceivably modify either of the two preceding "articles" words. 
However, the answer to the request is markedly different depending on 
the interpretation selected. The approach adopted here is to "attach" 
each prepositional phrase to the first noun to the left of the phrase 
that is a valid type for the preposition in question. In Fig. 8.13 the 
valid noun types that can be modified by each preposition are listed. 

Note that each preposition that immediately follows a noun and not 
a conjunction, must modify that noun and cannot be attached to other 
nouns further to the left. If the noun is not valid for the preposition 
by Fig. 8.13, then the request is considered in error. The request, 
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"Find the articles by John Jones and the citations at Harvard University.", 
would not he valid because the preposition, "at", is not a valid modifier 
of "citations" and cannot be attached to the earlier "articles" word 
because it does not immediately follow a conjunction. 



Modifiable Noun Types 
<noun> 

Article noun>, <citation noun> 
Article nouri>, ^citation noun> 
Article nouri>,<citation nouri> 
^article nouri>,<pitation noun> 
Article noun> , <pitation noun> 
<noun> 
<article noun> , <citation nouri> 



Preposition Type 
particle preposition)> 
<^word prepositioii> 
^author preposition^ 
^location preposition) 
^citing preposition) 
<pited by preposition^ 
<set preposition^ 
flustering preposition)* 



Fig. 8.13. Types of nouns that each class of prepositions 
can modify. 



8.3U6 Interpretation of Adjectives 

Let us make two final comments concerning the interpretation of the 
language. Filler words are adjectives, adverbs and certain other words 
that initiate no action in the interpretor. They are effectively ignored. 
Their only use is to make the statement of the request more smooth and 
natural. 

There are other adjectives and adverbs that do effect the inter- 
pretor, however. Some of them are listed in Fig. 8.7. A large number of 
adjectives and adverbs come to mind that would be very useful if imple- 
mented. However only enough of them were made part of the experimental 
system so the possibility of their use in the language could be tested. 
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PART FOUR: RESULTS AMD CONCLUSIONS 

Part Two introduced a theoretical model for a 
document retrieval system. The experimental system 
developed to test the model in a realistic environ- 
ment was described in Part Three. In this part ve 
present the experimental results obtained with the 
system and the conclusions about the model that can 
be drawn from them. 

This final part is divided into two chapters. 

Chapter IX: Experimental Results 
Chapter X: Conclusions 
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CHAPTER IX 
EXPERIMENTAL RESULTS 

In the first section of this chapter some data on the general 
characteristics of clusters will he presented. Then some specific 
examples will he given illustrating the composition of clusters in 
terms of the frequency of occurrence of title words, authors, and 
citations of the included articles. 

In the next two sections clusters will he compared with some 
existing sets of documents which have already heen Judged to he 
mutually pertinent. Three bibliographies found in review articles that 
are not part of the T.I. P. file and two subject categories compiled by 
indexers will he used for this purpose. 

Finally, the results of two tests will he presented in which 
clusters were evaluated by representative users of the document file. 

9.1 Cluster Parameters 

Before attacking the problem of whether or not clusters contain 
sets of documents that are mutually interesting to users, it may be 
appropriate to first summarize some of the more general features of 
clusters. This section will, accordingly, present statistics on certain 
cluster parameters. 

The data from which the statistics are drawn come from the tests of 
See's 9.3 to 9.5. They are, of course, a function of the particular 
requests presented to the system during the tests and of the composition 
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of the T.I.P. file at the time. It was thought, however, that this 
would serve as an introduction to the experimental results. 

The first parameter that will be described is cluster size. Pig. 
9.1 shows the distribution by size of some different clusters generated 
by the procedure. The largest cluster found so far contains H>9 docu- 
ments, while the smallest contains only one document. 
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1-20 21-liO Ul-60 61-80 81-100 101-120 121-up documents 

Cluster Size 
Fig. 9.1. Distribution of cluster size for lt90 clusters. 



One of the important features of the clustering procedure as 
described in Chapter V is its ability to adjust the size of the answer 
to fit the request. This is accomplished by applying a bias to the 
links of the document network (See Sec. U.U). About 82 £ of the clusters 
examined utilized either a positive or negative bias with the other 18# 
having no (zero) bias. 

In Pig. 9.2 the distribution of clusters for various ranges of bias 
is shown. Fig. 9.3 indicates that the average cluster size increases 
monotonically as the bias increases. This curve seems to follow the 
equation y »80(x-12) where y is the cluster size and x is the bias. We 
will not attempt to explain why this is the case here. 
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Pig. 9»2. Distribution of clusters by bias for 275 clusters. 



Average Cluster Size 




-20 -10 10 20 30 liO 50 60 70 80 90 100 bits Bias 
Fig. 9-3. Plot of average cluster size versus bias for 3l|0 clusters. 



Another characteristic of the procedure that can be studied is the 
way documents are deleted from the set (S) that is being formed. The 
formation of 37 clusters was observed. It was found that an average of 
three documents were deleted per cluster. This resulted in an average 
deletion of one document in every l£ iterations. It was also found that 
about 90 £ of the documents that were deleted from S were added to S 
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some later time during the clustering. 

Let us next ask when during the clustering process deletions occur. 
Fig. S.h indicates that deletions are more likely to occur toward the 
end of the clustering process. 
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Fig. 9.h- Percent of deletions occurring in each quartile of 
the clustering process. 
(average for 7$ clusters) 



In the final portion of this section we will describe the way the 
procedure responds to requests that are inconsistent or ambiguous. A 
specific example, (Cluster A., of Sec. 9-33) is used for this purpose. 
The first test consisted of holding the pertinent (y) set of the request 
constant and in successively placing every other member of the Cluster A 
in the non-pertinent (z) set (y=a 1 ; z=a, i=l,...,n). The results are 
shown in Fig. 9.5 and 9.6. 

There are three basic types of responses that resulted. In seven 
cases the size of the Cluster was reduced. This was, in general, what 
happened when the document specified as not pertinent had a smaller bias 
to A than a, did. In eight other cases the procedure was found to 
select another cluster (B,D, or E) containing some documents that were 
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not part of the original cluster. In the remaining twelve cases the 
request was judged to be inconsistent. A careful examination of the 
network revealed that in each of the twelve cases there was at least 
one cluster which could have satisfied the request. The reasons why 
the procedure was not able to locate a valid answer cluster in these 
cases have already been discussed in Sec. 5-5l. 

Fig.'s 9.5 and 9.6 illustrate two types of request ambiguity. The 
first type is hierarchal in nature involving clusters that are subsets 
of larger clusters. Take, for example, the request, Y=a; z=a 1 ^' I_t 
can be satisfied not only by the cluster listed for it in Fig. 9-5, but 
also by the smaller clusters listed for a„, a 1Q , and a^ Q . The second 
type of ambiguity is due to the fact that clusters overlap. Thus the 
clusters B, D, or E also satisfy the request Y=a ; Z.=a,Q. 

A second test was conducted in order to further study the extent of 
the second type of ambiguity. In this test a given document was speci- 
fied as pertinent and a cluster was found. The document which had the 
highest correlation to the cluster found was then specified as non- 
pertinent and another search was conducted. If a second cluster was 
found then the document with the highest correlation to the new cluster 
was added to Z and the process was continued. At some point the request 
became inconsistent. 

The results of this type of test on six articles is given in 
Fig. 9.7. Note that document a, of Fig. 9.5 would result in the test 
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ttern of Example U since a„ , is most highly correlated to A and the 



answer to the request (Y=a ;Z=a„,) is inconsistent. 
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n 


Inconsistent 


a l5 

a l6 


110.it 
102.8 


22 
27 


D 
AH(a^) 


a i7 


122.0 
106.6 
U6.2 
112.3 
ll*6.l* 


11* 
21* 
18 
21 
2 


B 


a l8 
a 19 


AO(a 5 a 12 a l6 a l8 ) 

E 


a 20 
a 21 


A fi<*5 a 10 a 12 a l5 a l6 a l8 a 20> 
E 


a 22 


121*. 1 


12 


Inconsistent 


a 23 


155.6 


1 


Inconsistent 


a 2l* 


11*1.8 


3 


Inconsistent 


a 25 


115.1* 


19 


E 


a 26 


130.1* 


7 


Inconsistent 


H 0-7 


127.0 


10 


E 



B=(a 1 a,a 1 Ha 1 ga- ) plus 12 other articles 
^(a.a-a. a^a.-a.-j) plus 11 other articles 
E *( a l a 2 a 20^ plus 20 other articles 

Fig. 9.5. Example of clusters which result when documents 
are specified as non -pertinent. 
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B- 



?t E 




•Fig. 9.6. Diagram of relationship of clusters of Fig. 9-5>. 
(Each circle represents a cluster) 



Example Size of successive answer clusters 

1 31, 22, 27, inconsistent 

2 17, 125, h, 2, inconsistent 

3 22, 36, 23, 23, inconsistent 
h 27, inconsistent 

5 33, 27, inconsistent 

6 39, 33, li*, inconsistent 



Fig. 9.7. Test of request ambiguity. 
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9.2 Cluster Composition 

In the last section statistics on some of the more general features 
of clusters such as size and bias were presented. In this section the 
composition of clusters will be described in terms of data available 
in the T.I. P. file. In particular, examples will be given of the 
composition of clusters in terms of the title words, authors, and 
citations of the included articles. 

In Fig. 9*8 we list in order of frequency of occurrence the title 
words for six clusters. Note that the common "function" words (in, of, 
the, and, on, etc.) have been omitted from all of the lists except for 
Example A. Also the lists have been truncated to include only the words 
that occurred most often in the titles. The full titles of Example B 
are shown in Fig. 9-l6. 

In none of the cases studied did the title of every article in a 
cluster contain the same word. For Fig. 9-8 the word that comes closest 
to occurring in every title is "plasma" of Example D, which occurs in 
18/22=82$ of the titles. If one were to group together words of equiv- 
alent meaning, then "superconducting" and "superconductors" in Example A 
would be highest with 27/31=88 . 

In Fig. 9*9 some similar examples are given for the authors of the 
articles in clusters. In Example A it was found that E. Schlomann is 
the author of two other papers in the T.I. P. file (in addition to the 
four listed), R. I. Joseph of one other, and W. Strauss of two others. 

In Fig. 9«10 citation counts are given for the same three clusters 
that were used in Fig. 9«9« In Example A there is one citation which 
is found in all of the articles in the cluster. In Example B, U6/6Ii=72% 
of the articles cite the same paper, while only 10/35=28% do in Example 
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Example A 

Cluster A~ 
Sec. 9.331 

31 articles 
99 words 



of 



Example B 

Cluster A of 
Sec. 9.317 
12 articles 
66 words 



Example C 

Cluster Aj-. of 
Sec. 9-33. 
22 articles 
? 5 words 



22 in 

22 superconducting 

19 of 

13 ultrasonic 

10 energy 

10 gap 

9 the 

8 attenuation 

5 and 

5 superconductors 

5 tin 

h by 

h determination 

ii waves 

3 (ll words) 

2 (l6 words) 

1 (58 words) 



7 waves 

S spin 

3 garnet 

3 iron 

3 magnetic 

3 magneto-elastic 

3 microwave 

3 nonuniform 

3 propagation 

3 yttrium 

2 crystal 



12 quantum 

11 oscillations 

8 ultrasonic 

6 attenuation 

6 field 

6 giant 

6 metals 

5 effect 

li magnetic 

k magnetoacoustic 

3 absorption 

3 sound 

2 alphen 



Example D 

Cluster An of 
Sec. 9-52. 
22 articles 
8k words 

18 plasma 

9 turbulent 

8 waves 

5> particles 

ii electromagnetic 

ll turbulence 

3 charged 



Example E 

Cluster A of 
Sec. 9.5l. 
J4O articles 
l5U words 

20 plasma 

17 probe 

11 langmuir 

9 probes 

5 characteristics 

5 field 

3> magnetic 

h electrostatic 

li resonance 

ii studies 

3 double 



Example F 

Cluster for article 
8 of Fig. 9.11 
22 articles 
8l words 

16 optical 

7 generation 

7 harmonic 

6 nonlinear 

5 theory 

j second 



Fig. 9-8. Title-word frequency counts for six clusters. 

(The number to the left of each word is the number 
of times it occurs in the titles of the cluster.) 



175 



Example A 

Cluster A 1 of 
Sec. 9.317 

12 articles 

13 authors 

h Schlomann Ernst 

3 Joseph R. I. 

2 Damon R. W. 

2 Strauss W. 

2 Van De Vaart H. 

1 (8 authors) 



Example B 

Cluster Ai of 
Sec. 9.32. 
61* articles 
75 authors 

7 Spector Harold H. 
h Prohofsky E. W. 
3 Gurevich V. L. 
3 Kroger Harry 
3 Pustovoit V. I. 
2 (8 authors) 
1 (62 authors) 



Example C 

Cluster A,- of 
Sec. 9.52"' 
35 articles 
38 authors 

7 Kraichnan Robert H. 

2 Deissler Robert G. 

2 Eschenroeder Allan Q. 

1 (35 authors) 



Pig. 9*9. Author frequency counts for three clusters. 



Example A 

Cluster A 1 of 
Sec. 9.317 
12 articles 
35 citations 

12 ll-3lt-1298 

7 1*1-8-357 

6 11-35-159 

k 11-35-167 

3 1-105-390 

3 1-120-2001* 

3 11-35-1022 

2 1-125-1950 

2 11-31-161*7 

2 11-35-2382 

2 11-35-2382 

2 11-36-875 

2 1*1-6-620 

2 1*1-12-583 

2 708-19-308 

1 (21 citations) 



Example B 

Cluster Ai of 
Sec. 9.32. 
61* articles 
369 citations 

1*6 1*1-7-237 

31 11-33-21*57 

29 1*1-9-87 

22 11-33-1*0 

19 ll-3i*-l51*8 

19 l»l-9-296 

18 1-127-1081* 

ll* 1-126-197U 

lit 1*1-8-1* 

10 i*i-lt-5o5 

9 l-13lt-1302 

9 28-8-161 

7 (li citations) 

6 (7 citations) 

5 (12 citations) 

1* (12 citations) 

3 (18 citations) 

2 (1*9 citations) 

1 (262 citations) 



Example C 

Cluster A^ of 
Sec. 9.52. 
35 articles 
195 citations 

10 802-5-1*97 

6 227-2-121* 

5 8-30-301 

5 799-7-1030 

5 802-12-21*2 

5 802-13-369 

5 802-16-33 

1* (3 citations) 

3 (13 citations) 

2 (33 citations) 

1 (139 citations) 



Fig. 9.10. Citation frequency counts for three clusters. 
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C. Example C is an illustration of an area where all of the articles 
do not cite one central paper and yet through the use of a large 
positive bias they can he pulled together into a cluster. 

The papers listed in Fig. 9.10 are identified by three numbers: 
The journal code (see Fig. 6.3), volume, and page number. Thus 
1-136-UUl is the paper beginning on page Wa in volume 136 of the 
Physical Review. 

9.3 Comparison to Bibliographies 

The next test will be to compare the bibliographies found in certain 
papers with clusters formed by the procedure. Consider, for example, a 
paper with 20 citations. It would be of interest to know if a cluster 
can be formed which includes most, if not all, of the 20 citations. 

For this purpose three articles were selected from the special 
October 196$ issue of the IEEE Proceedings on ultrasonics. It was 
decided that these articles which are not part of the T.I. P. file would 
insure some degree of independence between the data base and evaluation 
criteria. The TureR Proceedings represented a journal which is closely 
related to the T.I. P. physics file and yet is not actually part of the 
file. Since the T.I. P. file covers only the last three years, a recent 
issue of the IEEE Proceedings was needed if a suitable fraction of the 
bibliographies of the evaluating papers were to be found in the T.I. P. 

file. 

Of the twenty-seven articles in the October IEEE Proceedings, only 
ten cite ten or more articles in the T.I.P. file. Fig. 9.11 tabulates 
these ten papers. For the three articles to be used in evaluating the 
clustering procedure we selected the two papers with the highest percent 
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of their bibliographies in the T.I. P. file (l and 2) and the paper with 
the most references to the T.I. P. file (7). 



Articles in Proc. 
IEEE Vol. 53 



1. pp. ll»95-l507 

2. pp. lU52-lii6ii 

3. pp. 1517-1533 
k. PP. Ili38-ll»5l 

5. pp. 1508-1517 

6. pp. 1320-1336 

7. pp. 1586-1603 

8. pp. 160I1-1623 

9. pp. 1387-1399 
10. pp. 151*7-1573 





Citati 


ons 


Percent of 


Total 


to T.I 


.P. 


Bibliography 


Citations 


file 




in 


T.I. P. file 


22 


10 






1x6% 


38 


16 






k2 


58 


22 






38 


86 


32 






37 


U7 


17 






36 


33 


11 






33 


128 


36 






28 


67 


18 






27 


56 


13 






23 


101 


15 






15 



Fig. 9.11. Articles in the October 1965 Issue of the IEEE 
Proceedings that have 10 or more references to 
the T.I. P. file. 



9.31 Bibliography 1 (IEEE Proc, v. 53, p. Iii95 ) 

From Fig. 9. 11 we note that the article beginning on page lli95 
has 22 citations, 10 of which are to articles in the T.I. P. file. 
Fig. 9.12 lists the 10 articles as set B and also lists some other 
sets of papers that will be found useful in the discussion that 
follows. The i document in set B will be referred to as b ,etc. 

The answer clusters obtained by the procedure for 18 different 
requests are tabulated in Fig. 9.13. The symbol A[Y(b )Z(b )] stands 
for the answer cluster with h ± specified as interesting and b, 
specified as not interesting (i.e. Y=b.), Z-(b )). 
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B 



H 



1-136-UU2 11-36-31*53 1-129-991 

11-35-159 6U6-5-176 1-130-U39 
U-35-167 ^^-l^ 

11-35-1022 1-13U-U07 

11-36-108 F 1-136-1657 

rrii F-ri ^ 

hi-11-69 
U1-11-69 

G ,,_ iil-lii-25h 

11-35-836 310-7-1392 



11-36-12U5 11-35-993 1U6-2-38' 

ll-36-3li02 11-36-661 669-16-hlO 

11-36-18U5 669-18-2 35 

790-8-59** 



Fig. 9.12. The sets of articles included in the 
clusters for Bibliography 1. 



Answers to Selected Requests; 

A[Y(h.)]=A 1 for 1-2... 5,7,8,10 A^b^A^)^ 

AtYCb^]^ A[Y(b 9 ),Z(h lU )]=A x 

A[Y(b 6 )]=A 2 AtY^)]^ 

A[Y(b )]-A, A[Y(b b )]=A UF plus 5 members of H 

ALU V 3 12 1 and 50 other articles 

AtY^.-.b^)]*^ 

A[Y(b 1 ...b 1Q )]=A 2 yA 3 

Definitions of Clusters: 

Al =(b 2 ...b 5 ,b 7 ,b 8 ,b 10 )UDUE A 3 =(b 9 )UEUH 

A^UO^U* A^Cb^UG 

Fig. 9.13. List of the answer clusters formed for Bibliography 1. 



179 



In Fig. 9»lh the probable answers for requests consisting of other 
combinations of b's are suggested. All of the requests listed in this 
figure have not been actually tested, but experience with the clustering 
procedure and the results of Pig. 9»13 make it appear reasonably safe 
to assume that the conclusions are correct. 

AlYO^b H^ for 1, j-2...5,7. . .10 (i/j) 
A[Y(b 6 b i )]-A 2 for 1-2...10 

A[Y(b 1 b i )]- (large set of 70-100 articles) for i=2...10 
A[Y(b )z(h i )]=A 1 for 1-1...18 

A[Y(Any combination of b . . .b^b.,. . .b. rt )]=A, 

c. i> I 10 1 

A[Y(b/r plus any combination of b_...b n )]=A 

AtY^ plus any combination of other b's)=(large set of 70-100 articles) 

Fig. 9«lb« Generalizations suggested by the results of Fig. 9.13. 

A diagram showing the amount of overlap of the various answer 
clusters is shown in Fig. 9«U>. 




Fig. 9 .15>. Sketch showing the relationship of the 
answer clusters of Bibliography 1. 
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Some comments will now be aade concerning the results given in 
Jig.'s 9.12 - 9.15. When the request consists of a single member of 
the bibliography, the same answer results in 7 out of 10 cases. This 
cluster, A x , contains 8 of the 10 articles in the bibliography (^ and 
b 6 are omitted). 

The article b q is included in A x but does not result in k^ when 
used as a request. It results in an almost completely different set of 
documents (Aj which contains only one member of the bibliography. The 
request Y(b g ) is, therefore, ambiguous with either k x or k^ being a 
valid answer. To resolve the ambiguity various documents from the set 
H were placed in the non-pertinent set Z. This shifted the answer from 
A to A . It was found that the ambiguity could also be resolved by 
placing an additional document in the Y set. Thus a request of Yftgb^ 
also resulted in the answer A . 

The cluster Ag exemplifies another type of ambiguity. The set k ± 
is a subset of the set Ag and thus the requests *{\) where i-2...5,7, 
8,10, could be satisfied by either k^ or Ag. The request Y(b 6 ) can 
only be satisfied by Ag, however, since b 6 is not included in A^ Thus 
the article b 6 is slightly "beyond" the cluster ^ and if used in the Y 
set of the request results in more general cluster Ag of 17 documents 
instead of the cluster k ± of 12 documents. Mote that both requests of 
the form Y(b b,) with 1-2...10 and the larger request Y(bg...b 10 ) 
result in the cluster Ag. 

The only article from Bibliography 1 which is not included in A g 
is b . The request Y^) results in the cluster A^ which is disjoint 
from any of the clusters discussed so far. When requests of the form 
Y(b b ) 1-2...10 are used, very large clusters result including most 



181 



of the documents listed in Fig. 9*12 and many more. A check of the 
paper from which Bibliography 1 was taken reveals that b 1 is cited 
only as a source for the values of some constants. It is suggested 
that this may be the reason it does not fit into the closely-related 
cluster A_ which includes the other nine papers. 

One final observation will be made. There are four articles in 
A.., and nine in A„ that are not part of the original bibliography. 
The question of whether these papers constitute valid additions to the 
bibliography will be discussed in Chapter X. Let us at this point, 
however, present the titles of the papers in A.^ (Fig. 9»l6) as an 
illustration of the type of additional articles included in the 
clusters . 

9.32 Bibliography 2 (IEEE Proc, v. 53, p. 1^52) 

In Fig.'s 9.17 - 9.20 we present the same data for Bibliography 2 
that were given for Bibliography 1. Here again a large majority of 
the documents (ll of 16) in the bibliography lead to the same cluster 
(A.) when specified as interesting in the request. 

From Fig. 9*20 we observe that clusters A..,...,Ai form a hierarchal 
series of increasingly larger sets with each new set including the 
previous set. The set A, contains lii of 16 members of the bibliography 
and 50 other documents. The set A is the only set in the series that 
has bias. The series can, of course, be extended to sets which are 
larger than A, or to subsets of A 1 by additional changes in the bias. 

There are two members of the bibliography (b^ and b.,,) that do not 
fit into the pattern set by the other lit members. The article b^ has 
no positive connection to any other paper (i.e. none of the papers it 
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Print the titles of the articles related to J Appl Phys v. 35 p. 159. 
12 documents in set 1. 

Journal of Applied Physics, Volume 35, page l59» 

Generation of spin waves in nonuniform magnetic fields I. 
Conversion of electromagnetic power into spin-wave power and 
vice versa. 

Page 167 

Generation of spin waves in nonuniform magnetic fields II. 
Calculation of coupling strength 

Page 1022 

Magneto-elastic waves in yttrium iron garnet 

Volume 36, page 118 

Magneto-elastic waves in yttrium iron garnet 

*Page 12145 

Electronically variable delay of microwave pulses in 
single-crystal YIG rods 

Page 1267 

Microwave magneto -elastic resonances in a nonuniform magnetic 
field 

Page l579 

Demagnetizing field in nonellipsoidal bodies 

* Page 3U02 

Anisotropic spin-wave propagation in ferrites 

*Page 3ii53 

Propagation of magnetostatic spin waves at microwave 
frequencies in a normally-magnetized disc 

Physical Review Letters, Volume 12, page 583 

Dispersion of long-wavelength spin waves from pulse -echo 
experiments 1 

Applied Physics Letters, Volume 5, page 33 

Propagation, dispersion, and attenuation of backward-traveling 
magneto-elastic waves in YIG 

*Page 176 

Wall effects in single-crystal spheres of Yttrium iron garnet 
(YIG) 

End. 9-6 sec. used. 

Fig. 9.16. Titles of articles in the A., cluster. 

(The four * articles were not part of the 
original bibliography.) 
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1-13U-1302 


1-129-1009 


1-135-1761 


1-130-910 


1-136-772 


1-131-1087 


1-136-1731 


1-131-2512 


1-138-1721 


1-132-522 


11-35-125 


1-132-679 


11-36-528 


1-131* -507 


1*1-11-21*6 


1-135-1388 


1*1-12-1*7 


1-137-311 


1*1-12-555 


1-138-1250 


1*1-13-1*31* 


1-139-191*9 


l*l-ll*-372 


3-81-130 


6U6-U-82 


11-35-137 


61*6-1* -190 


11-35-11*83 


61*6-1*-212 


11-36-3728 


ll*6-6-8l 


21-31-1700 




29-30-11*9 




29-31-957 




la-13-308 




1*3-37-51*5 




1*9-1* -1*5 



D (Con't.) 

U9-U-19U 

1*9-13-285 

1*9-17-11* 

80-19-671* 

80-20-1131 

80-30 -li*2l* 

80-20-161*7 

80-20-191*6 

80-20-2160 

310-5-1818 

310-7-688 

38U-32-100 

612-3-1*1*8 

612-3-698 

669-16-383 

669-16-1612 

669-19-21*2 

669-19-11*07 

669-12-1113 

821-2-11*9 



l*l-lii-706 
310-6-2233 

F 
669-17 -1U32 

G 
1-136-869 
1*1-12-21*1 
1*9-19-268 
310-6-21*73 
61*6-7-1*5 
61*6-7-82 

H 
1-130-919 
1-131-95 
1-131-11*69 
1-133-183 
1-133-11*93 
l-13l*-728 
1-13U-1313 
1-131* -11*2 9 



H (Con't.) 

1-135-51 

1-135-1662 

1-137-801 

1-137-1305 

1-138-531* 

1-138-1559 

1-139-539 

1-11*0-2110 

1-11*2-126 

3-82-1*01 

3-86-709 

11-36-22 

11-36-3281 

12-39-11*93 

21-30-1717 

21-30-1817 

1*1-11-11* 

1*1-11-11*6 

80-20-363 

669-21-1031* 

821-2-11*1 



Fig. 9.17. The sets of articles included in the clusters 
for Bibliography 2. 



Answers to Selected Requests! 
ALY(b )J«A 1=1,2,3,5,7,8,9, 
x 11, 12,11*, 16 

AtY^)}^ 
A[Y(b l5 )]-A u 
A[Y(h 6 )]-(t 6 ) 
A[Y(h 13 )]-A 5 

A[y ( b 2 b lt )] " A 3 

Definitions of Clusters : 
B l-( b l b 2 b 3 b 5 b 7V9 b ll b 12 b ll* b l6> 



B„ 



■ B iU b 



10 



V B 3 U b i5 



A[Y(b i5 b i6 )] - A l* 

A[Y(b it b 13 )]=A u yb 13 L|(29 others) 
A[Y(b x . . .b^b ? . . .b^b^. . -b^^^l* 

A[Y(b l]4 )Z(d 22 )]=A 5 

A[Y(b lU )z(b 3 )=A 5 n(b 9 b i;L h l8 h 19 h 2;? b 3 ) 

A[y(b lU )z(b 3 b 13 )]-(b 8 i> 9 b 11 b lli )U 

(d 2 d 6 d 20 d 22 d 2i* d 25 d Ul ) 

A^UD 

A 2 =B 2 UDUE 

A 3 =B 3 yDUEUF 

a^udIIeUfUo 

V ( Vl3VU H 



Fig. 9.18. List of the answer clusters formed for 
Bibliography 2. 
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AtYtb^)]^ 
A[Y(bzb,)]= Inconsistent 



for b i ,b jC B 1 

for b^C^ 

for b^Bg 

for b J CB 3 

(b^ is not linked to any other paper.) 



AtY(b 13 b i )]-A ii Ub 13 (29 others) for b^B^ 
A[I(X 1 )]=A 1 

A[Y(b 10 X l )3=A 2 
AfrO^Xg)]^ 

AtlCb^Xj)]-^ 



for XjCBj^ 
for Xf:\ 
for XgCBg 
for X 3 B, 



Fig. 9.19. Generalizations suggested by the results of Fig. 9*19? 




Fig. 9«20. Relationship of answer clusters of Bibliography 2. 
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cites are cited by other papers) and is thus isolated from the rest of 
the file. Article b. , can be included in a cluster with the rest of 
the papers if the bias is made large enough. The cluster A[Y(b. K,)] 
contains, for example, all of the bibliography except b/-. 

There is one significant characteristic that the five papers not 
included in A., have. They all have relatively few citations. Articles 
bx and b^ have only two citations each. Articles b.,- and b.,^ have 
only three. Article b. has seven. In contrast the bibliography 
articles in A., all have seven or more citations except b_ and b,i 
which have five each. It is suggested that perhaps the reason bz and 
b-, are not included in the cluster A 1 is that they have insufficient 
references to position them properly in the network. 

9.33 Bibliography 3 (IEEE Proc, v. $3, p. 1586) 

In Fig.'s 9*21 to 9»2k the data for bibliography 3 is presented. 
The paper from which this bibliography is taken has four sections 
(l,II,III,IV) with section III haveing four subsections (ill A, B, C, D). 
The particular section (and subsection) in which each bibliographic 
item is first cited is noted in Fig. 9*21. These section numbers are 
also noted over the symbols for the documents in Fig. 9*23. Some of 
the documents in Fig. 9*23 are inclosed in parenthesis. This is to 
indicate that the document has already appeared elsewhere in the 
diagram. 

From Fig. 9.23 we note that a hierarchal series of clusters (A., to 
A, ) similar to the one in Fig. 9.20 is formed by 13 of the documents 
of Sec. III. A similar but separate series (A,- to A ft ) is formed by the 
documents of Sec. IV. There also appears to be a separation of the 
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B 



E K M (Con't.) 



1-129-12 IIIA 1-129-1990 

1-129-13 IIIC 1-131-2512 

1-129-652 IIIA 1-133-1589 

1-131-111 IIIA l-13ii-507 

1-131-653 IIIA 1-136-1170 

1-131-1^97 rv 1-137-1717 

1-131-2U20 HID 1-138-88 

1-132-1062 IV 1-138 -lii53 

1-132-1073 IV 1-139-18149 

1-132-2039 IV Ul-12-357 

1-133-1U87 IV 310-7-383 

l-135-7iiO IIIA 669-17-628 

1-135-1161 IV 

1-136-1096 HID 

1-137-211 IHC 



1-131-73 


80-18-1569 


1-132-621 


669-16-1U81 


1-1314-1 


669-17-87 


1-135-19 


669-18-51 


1-136-306 


669-18-896 


1-136-203 


669-20-267 


1-136-893 


669-20-560 


1-136-11+71 


669-20-583 


1-138-1661 


669-21-75 


1-139-7U6 




l-lliO-1902 




l-ll4l-l452 


N 


1-1U3-229 

Ul-15-862 

669-16-9U5 

669-I8-83I4 

669-21-7014 


1-131-2U33 

1-131-2U63 

1-132-1991 

1-136-998 

1-137 -1431 




141-12-558 


R 


80-20-1136 


669-18-1260 


P 



1-137-889 IHC 669-18-1125 

1-137-lliOO IIIC 669-19-^9 

1-138-U87 IHC b&9-i9-i^9 

21-29-357 IV 

I4I-H-316 HID 

I4I-I2-IOI4 IIIC - 

hi-12-166 iiic 1-138-1191 

141-12-360 HIE 669-I6-15I4 

l4l-13-l62 IHC 669-I8 -I4I9 M 1-133-llOU 

li9-7-112 HID - 1-139-1876 

149-8-155 IIIA 1-129-1088 1-1143-1452 

U9-8-160 IV H 1-130-92 h9-13-282 

U9-12-297 IHC „ „, 1-130-565 

li9-13-287 IHC L -\\l ~°Ji 1-131-617 

U9-1U-13 HIA iTi !?"«? 1-131-1995 ft 

149-1U-73 IHC iu.-u.-w 1-131-2078 2q 2Q cc 

U9-17-18U HIC 1-132-1512 t~ * ,00? 

6U6-6-111 IV 1-133-UU Tliio-187 

669-17-50 IIIA ± 1-133-15U6 I fj° {I' 

669-18-I403 iiic U9-5-233 1-135-1698 |"rr" Tiy 

669-20-552 IIIA 149-7-133 1-137-1172 i,q 7 7 

80-20-1U2I* 1-137-1706 Ln" O OQ7 

D 1-139-823 80-20-13714 

^^" I'lit^l 3°iO-6-2 3 5 7 6 U 5 

1-132-522 t lk02065 669-16-818 

1-132-535 i"Sll?2 669-16-1U59 

1-135-181 i.JJi.^3 669-18-908 



9.21. The sets of articles included in the clusters 
for Bibliography 3. 
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Answers to Selected Requests; 



ACYd^J-A-L 


1-1,2, 20,23,36 


AtY^)]^ 
A[Y(^)]-A 3 
AtYCb^)]^ 




AtY^)]-^ 


i-l5... 17, 22,2k, 
28,29,32 


A[Y(b i )]=A 6 


i=8...11,13,27 


AtY^)]^ 
AtY^)]"/^ 


i«l8,19 


AtY^)]^ 


i«lt,3li 


AtY^)]-^ 
A[Y(b 30 )]*A n 




Definitions of Clusters: 



A 1 =(b 1 b 2 b u b 23 b 3ii b 36 b l6 b l8 b 20 ) U 

dUe 
a 2 -a 1 U(ViU>Uf 

A^AgU^JUG 

V A 3U<VUH 

A 5" (b l5 b l6 b 17 b l8 b 20 b 21 b 22 b 2U 
b 28 b 29 b 32 )UD U J U(e 1 h 1 ) 

W 9 b 10 b ll b 13 b 27 ) U K U 
(h^e^eg) 



AtY^ )]-A 3 U( b i5 b i7 b 2 l h 3 J i ) 

A[Y(b.)]» Misc. large sets of 

documents (88-159 articles) 
i-3,12,25,26,31,33 

A[Y < b l8 b 21 )] " A 5 

a[ Y(b 2 b 22 b 2ii b 3$ ) MAjU a 5 U (b ? b 35 f 2 ) ] 

O(b-) 
A [Y(Wb 2 )] -(cluster of 108) 

AtY ( b l6 b l8 b 29 b 3$ )] - A 12 



VMKsV 

V ( V5 b li l b 3U b 36>UM 
A 10 -A 9 (J(b 7 )UHU(e 7 ) 
A^^b^^UPU 

(d 6 e l e 6 e 8 b l b 2 m l5 B 17 q 6 ) 
A 12" A 3U A 5UK 2 1 7 ) 



Fig. 9«22. List of answer clusters formed for Bibliography 3. 
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IIIC IIIC IIIC IIIC IIIC IIIC IIIC IIIC 



IIIA IIIC IIIA HIE IIIA IIIA 




21 



22 



3 2U 



b 29 b 32 



iiic iiic hid 



J 18 



w 20 



d r ..d 6 



23 



3 3U 



HID HID 



'll» 




IV IV IV IV IV IV 



\0 b ll b 13 b 27 




*V_ • • * A- n 

(e.)(e a ) 



IIIC IV 
(b lft ) b 



19 




Pig. 9.23. Relationship of answer clusters of Bibliography 3. 
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documents by subsection within Sec. III. Note that 10 of the 13 docu- 
ments cited in subsetion IIIC are included in cluster A^. 

The structure of the clusters in this example was found to be 
considerably more complex than in the previous two examples and no 
attempt is made to predict the results of requests that have not been 
explicitly tested. One can gain some appreciation of the complexity of 
the interrelationships between the clusters by an examination of 
clusters A„ to A.. . 

As with Bibliographies 1 and 2 there are a few of the documents 
that are not included in the clusters of Fig. 9.23. Vine articles are 
cited by Sec. IV. All of these except b,, are included in the cluster 
Ag* Thirteen articles are cited by See. IIIC. All of them but bo'^ll' 
and b«? are in A^ and all but b— are in A_. The cluster Aj~ Is more 
general in that it includes not only articles cited by Sec. IIIC but 
also those cited by Sec.'s IIIA, D and E. Of the 27 articles cited by 
See. Ill, 20 are included in A... The seven missing articles are b,,b^, 

b 12 ,1 ^5» , ^6' b 30' and b 31* 

The article b, was examined in detail in an attempt to discover 

why it was not included in A... It was found to have six references. 
Of the six, one was keypunched incorrectly. Two of them are to articles 
in a Russian journal (Soviet Physics - JETP), whereas the other refer- 
ences to these articles in the T.I. P. file are to the journal in which 
the English translation is found. A fourth reference is to a paper 
written by the same author and not cited by anyone else, and a fifth is 
to a bulletin, which was evidently not sufficient to cause it to be in- 
cluded in A. 2* I* was found that if the references had been correctly 
keypunched and had been to the correct English translations, b, would 
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have been included in A., and probably A^. 

There is one other feature of the article from which Bibliography 3 
was taken. In the final paragraph the author made this comment. 
"I wish to thank ...A. R. Mackintosh for calling B. I. 

Miller's work to my attention." 

The article by B. I. Miller was checked to see if it would have 
been included in any of the clusters if it had been part of the T.I. P. 
file. It was found to have only one reference but this reference was 
sufficient to cause it to be included in A.^. Thus this procedure 
could have performed the same reference service that A. B. Mackintosh 
did. 

9.U Comparison to Categories 

In the last section we compared clusters to the bibliographies 
compiled by the authors of three articles. Another source of sets of 
articles that have been judged to be related would be the subject index 
found in one of the Journals or in Physics Abstracts . For this purpose 
one category was selected from the subject index of Physical Review and 
one category was selected from Physics Abstracts . 

9«Ul Physical Review Category 

Most of the categories in the Physical Review Subject Index are 
very broad. The sets formed by clusters, on the other hand, are in 
general much smaller and much more specific. Of course, larger clusters 
could be formed by including a large number of articles in the Y set of 
the request, but they would require a large amount of effort to process 
and compare. For this reason a category with relatively few entries was 
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selected. Its title changed periodically over the three year period, 
but it was identified as the one which was referred to when one looked 
up the word, "luminescence" in the word list which was supplied with 
the subject index. The various titles used for the category are as 
follows : 

1963 Luminescence (l8 articles) 

196h 1*6.1* Luminescence and Fluorescence (6 articles) 
196^ 1*2.3 Optical Emission and Absorption (17 articles) 
1966 1*1*. 3 Optical Emission and Absorption (2 articles) 
The same format used for presenting the data in Sec. 9.3 is used 
here in Fig. 9.2l*-26. 

It will be seen from Fig. 9-26 that most of the papers separate 
into the three major areas represented by A^, A_, and k^. A statisti- 
cal analysis of the composition of each of these three clusters is given 
in Fig. 9.27. It is found that the only words that appear more than 
once in the titles of two or more of the clusters are optical, absorp- 
tion, radiation, and crystals. The correspondence of these words to the 
title of the original category (optical absorption and emission) is of 

interest. 

A similar analysis of the author lists showed that H. Bloembergen 
was the only author that appeared more than once in two or more of the 
lists. The citation lists were also found to have very little overlap. 
The greatest overlap occurred between A and A ^. For example, the 1st, 
3rd, 5th, 7th entries in the list for A were found in the list for A^ 
with a count of 2 . 

It is thus concluded that the articles in the clusters A^, A^, 

and A,w do have different characteristics. Whether the distinction 
2o 
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B D G R 

1-129-169 1-13U-1166 1-139-588 1-133-163 

1-129-593 1-137-801 1-11*0-576 1-133-1717 

l-129-2l»22 1-138-1 1-13U-299 

1-130-502 1-138-960 1-1314-1*23 

1-130-639 3-82-393 1-135-1676 

1-130-9U5 3-85-565 1-137-583 

1-130-2257 3-86-709 H 1-137-1016 

1-131-127 1*1-12 -5ol* , 1?q iq8o 1-138-276 

1-131-501 1*1-13-331* llii'.lTo 1-139-1687 

1-131-508 1*1-13-657 1-132-21*50 1-139-1965 

l-131-llll* 1*1-13-720 1-11*0-880 

1-131-11*56 1*9-10-52 80-19-2260 

1-131-151*3 1*9-11-291* 669-21-201* 

1-131-2036 61*6-6-25 j 

1-132-221* 

1-132-1023 1-131-1912 

1-132 -11*82 1-132-1029 

1-132-2501 1-135-950 M 

1-133-1163 E 1-135-1622 ao-lQ-92k 

1-136-11*1 . ... 1Q 1-137-1087 80 19 92U 

1-136-271 i"S"l05l 1-138-1287 

1-136-508 j"^)?"pfl7 1-139-3H* 

1-136-51*1 i"Sl1o6 H-3i*-l682 

1-136-1091 n iJ"S n-35-1183 

1-137-508 igQ^;f 7? 3 12-38-151*1* N 

1-137-536 ^S"iS1o2 12-38-1607 ?— 

1-137-1117 199-139-202 12-38-2289 1-11*0-957 

1-137-1651 12-39-3118 1*9-5-186 

1-137-1787 12-1*2-1999 6l2-l*-261* 

1-138-63 1*9-18-219 

1-138-180 1*9-19-98 

1-138-806 _ 80-18-11*1*8 

1-138-171*1 80-19-1096 

1-139-321 1-129-125 

1-139-51*1* 1-132-2023 

1-139-1239 1-137-1515 

1-139-1616 1-138-11*72 

l-ll*0-l55 l-138-li*77 

1-11*0-263 1-139-1262 K 

1-11*0-601 1-139-1991 1-133-1029 

1-11*0-1867 ?-^°i? 2 1-136-U81 

1-11*3-372 l*l-ll*-61* 

1-11*3-571* 1*9-19-89 



12-1*2-31*01* 



1-139-970 



Fig. 9.2l*. The sets of articles included in the clusters 
for Category 1. 
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Answers to Requests: 
A[Y(^) i )]-A 1 i-29,lt2 
AttO^)]-^ i-26,U3 
A[Y(D 3ll )]«=A 3 
ACY^)] 



A[Y(b, 



28 



)]-A, 



A[Y(b 30 )]-A 6 
AClO^)]^ i-8,19 
A[X(b )]-*g 

A[T(b li l )3 " A 9 
A[Y(b 39 )]-A 10 

AtY^)]^ 

A[Y(b l7 )] 



12 

A[Kb 4l )]^ aJl 

AtYft^Jl-A^ 
A[Yd, Uo )]-A l6 
AtY^)]-^ 
AtY^Jl-^g 



i«$,12,27 



i-7,22,2U 



AtY^) 
A[Y(b i ) 



A[Y(b, 
A, i-33,37,38 A[Y(b 



25 
35 



A[Y(b ± ) 
A[Y(b l5 



,A 19 
,A 20 
-^21 
-A 22 
"^3 

-(b ± ) 



i-10,11 
i-13,18,20 



i-li,6 



AtY^)]-^) i-3,9,U 
AtY^JJ-Clarge clusters) i«23,32,36 
A[ Y(b 1 b 2 b 12 ) ]-(l07 articles ) 
A[Y(b 28 b 3ii )]-A 3 UA r A 25 
A[Y(b 2 gb 3() b 3it )]-(lOli articles) 
A[Y(b 35 b U2 )] -(large) 
A[Y(bgb 17 )]« (large) 
A[Y(b 2 b 39 )]-(large) 
AtYfbg^JJ-darge) 

A[Y(b l8 b 2U b 27 )3 " A l5U A 17 U^sU^O 



Definitions of Clusters : 
V#29 b 3A2 )UD 

A 3- A 2U(V38 ) 

V (b 2 9 b 33 b 37VU E 

V A i,U<V 
A 6-< b 30 d l> 

A^-^gb^JUPUG 
Ag^UCb^) 

A 10' (b 39 g 2 ) 
A^b^gj,) 

*12 -Cb l7 f l g l ) 
A 13"^ b 5 b 12 b 27 * U (r.r„r 



A ilf 


a^UO^Uk 


A l5' A H*U< 1> 3l r 5 ) 


A 16- 


(b l b 7 b 27 b UO ) U( r i-' r 8 r 9 r ll ) 


A 17 - 


< b i b 7VWUR 


A 18" 


(b 7 b 22 b 2U n l r 2 r 8 ) 


V 


(b io b n B i ) 


**>" 


(V'isWisVU* 


^1- 


(b 25 k 2 ) 


hz" 


^V^i* 


A23- 


<V»6> 


V 


(b l5 r 7 r ll ) 


^5" 


^U^ 



^VlG^^J A 26" A 15U A i 7 ^ A 1BU A 20 fyW6> 



Pig. 9.25. Ansvers to selected requests for Category 1- 
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Fig. 9.26. Relationship of answer clusters for Category 1. 
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CLUSTER /L^ 



CLUSTER A,. 



(30 


articles) 


(18 


articles) 


109 words: 


81 


words : 


13 


raman 


7 


SiC 


9 


stimulated 


6 


Exciton 


6 


laser 


5 


Complexes 


6 


radiation 


1* 


Absorption 


6 


scattering 


I* 


Luminescence 


5 


theory 


3 


CdS 


1* 


fluctuations 


3 


Effects 


k 


intensity 


3 


Emission 


3 


effects 


3 


Hitrogen 


3 


emission 


3 


Optical 


3 


liquids 


3 


Radiation 


3 


media 


3 


Recombination 


3 


optical 


2 


Cadmium 


3 


order 


• 
• 




3 


waves 


• 




2 

• 
• 


anti 






• 

37 


authors : 


2 
6 


5 authors: 


^~ 


Shen Y. R. 


Choycke W. J. 


U 


Bloembergen H. 


6 


Hamilton D. R. 


2 


Armstrong J. A. 


2 


Patrick Lyle 


2 


London R. 


2 


Dean P. J. 


2 


Smith Archibald W. 


2 


Reynolds D. C. 


2 


Tang C. L. 


1 


Anders W. A. 


1 

• 


Anderson H. G. 


• 
• 
• 




292 


citations t 


2l*8 


citations : 


12 


1-127 -1918 


13 


Ul-U-361 


10 


1-130-2529 


11 


1-128-2135 


10 


1-131-2766 


11 


1*1-1-1*50 


10 


1-133-37 


10 


1-127-1868 


10 


ia-9-itfS 


8 


1-131-127 


10 


la-11-160 


7 


1-116-1*73 


10 


U9-7-186 


6 


1-133-1163 


9 


61x6-3-181 


5 


1-120-1661* 


8 


1*1-11-1*19 


5 


1-127-1878 


8 


U1-12-50U 


5 


1-132-2023 


7 


1-13U-1U29 


1* 


(5 citations) 


7 


61*6-3-137 


3 


(7 citations) 


6 


1*1-12-290 


2 


(1*2 citations) 
(181* citations 


5 


(5 citations) 


1 


ii 


(11 citations) 


• 
• 




3 


(17 citations) 
(3U citations) 


• 




2 






1 


(212 citations) 







CLUSTER A 26 

(55 articles) 

21b words ; 

12 ruby 

11 optical 

9 lines 

8 KCL 

8 spectra 

7 crystals 

6 absorption 

6 thermoluminescence 

5 excited 

5 F 

5 HgO 

1* center 

1* Cr+ 

h irradiated 

1* R 

h relaxation 

3 alkali 

• 

85 authors; 

~5 Sturge M. D. 

5 McCumber D. E. 

3 Bloembergen N. 

3 Schawlow A. L. 

3 Yen W. M. 

2 Arten J. 0. 



81*6 citations : 

22 80-13-880 

15 1-122-381 

15 12-36-2757 

lb ll-3l*-l682 

13 1-122 -1U69 

10 1-130-639 

10 12-20-1752 

9 80-13-899 

8 1-57-1*26 

8 30-31-956 

7 (3 citations) 

6 (12 citations) 

5 (8 citations) 

1* (18 citations) 

3 (33 citations) 

2 (121 citations) 

1 (71*1 citations) 



Fig. 9»27. Comparison of the three clusters formed for Category 1. 
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■between the clusters is of practical significance to a user would, of 
course, require further experimental justification. 

As an additional comparison the results of this section were com- 
pared with the articles found in the category in Physics Abstracts with 
the title, "luminescence." This category contained 22 of the articles 
listed in Fig. 9.2k. (lb in set B and 8 others.) All of these 22 
articles were included in A or A ,. This would tend to indicate that 
the Physics Abstracts indexers considered the articles of A ^ to be in 

9 and A 26 



a different area than A_ and A„/- also. 



9. .J4.2 Physics Abstracts Category 

Since a property (luminescence) was chosen for the last section, 
it was decided that a category covering a substance might be appropriate 
for this test. We again sought a category with relatively few entries 
so that it would be easier to compare it with the related clusters. 
The category with the heading, "Erbium", was selected. The articles 
classified in this category from January 1963 to the present are listed 
in set B of Fig. 9.28. Fig.'s 9-29 and 9.30 present the related 
clusters. 

9.5 User Experience 

In the last two sections we compared the results of the clustering 
procedure to the three bibliographies and two categories. In this 
section we will present the response of the system to some actual 
requests for information. The response to both a relatively simple 
request and to a more comples request are studied. 
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29-31-1 
1*9-6-19 



1-136-717 12-38-2171 

1-136-726 12-39-3251 



1-11*0-1968 12-l»2-l62 



B E H M (Con't.) 

1-131-101(3 1-131-158 1-139-21(1 12-39-1021* 

1-131-1586 1-I3l(-1620 3-82-871* 12-39-1151* 

1-132-1609 1-137-1139 12-38-2750 12-1(0-71(3 

1-137-138 1-138-21(1 12-1(2-1(000 12-1(1-892 

1-137-1109 3-85-955 12-1*3-1680 12-1*2-71*3 

11-35-101(7 11-36-1209 8O-I8-I636 l6i(-39-3l(2 

11-36-1001 U9-17 -96 310-7-11(50 
11-36-1127 j 

11-36-121(9 F i lla 9V c S 

12-38-2190 1-132-5U2 1-132 So 1-138-151*1* 

12- 39 -i285 1-133-219 l-lll-lll 12-38-11(76 

12-39-1629 l-13l(-9l( l-136-lu33 12-38-2190 

12-39-2128 11-35-800 l-lkO-Toll 12-39-213U 

12-1,0-2751 12-1*3-2087 I-lS-ll5 12-U-1305 

12-1*0-3606 t? M S? 12-U1-3227 

*•$;•%% G ^ : ^-6i7 12 ^ 3 - 1702 

!? ]p"ft7? 1-129-1601 1(1-11-196 P 

12 t1~8L7 1-130-1100 1-133-1361* 

29-20^77 1-133-1571 K 1*9-19-1*63 

LQ-8-5 1-13U-320 1-11(1-1* 

LQ 11 100 l-13l*-ll(92 1(3-36-505 „ . ^ ^ 

tq'n"^?? 1-136-175 1-137-1886 12-1(1-1970 

KI15I301 1-136-231 1-139-2008 R 

1*9-16-265 1-136-271 3-81*-297 11-36-21*22 

U9.-17-95 ^HMH S"5"S£, 80-20-997 

80-20-808 

80-20-1332 



199-137-790 1-137-627 12-1(0-796 1-133-1361* 

310 6 ???? 1-137-11*1(9 12-1*0-31(28 

310-6-2225 l.-iJ.n.ia/W io.I.o.iAo T 



l-lld-352 12-1(2-993 21-29-971* 

n 19Q ?n7? 1-11(1-1*61 12-1(2-3797 1(9-20-1*96 

I-I30I1337 3-81-663 12-1*3-2121* 

x 130 133 f 12-39-11(22 1*1-11-253 U 



l"Sl32 5 ^^5 ^^ gi^IUs 

£So 9 *'S~\*? "VHt- 66 9-l8-1022 

1-138-216 S'^'S! 1-130-91(5 v 

1-139-1606 124*2-981 1-130-1370 -r 

i-S-1896 12-Itf-U23 1-133-31* 1-U5-97 

3-S-8U6 21-29-9W i-i^-l^l* w 

? si. a-j 21-31-81*5 1-13U-172 

fifc-S, 21-31-1325 1-13V1501( 1-^*0-1188 

11-36-9^6 f9-10-l6 l-137-l?l*9 1 ' 1 ^^ 1 

ll-36-io?8 ^"^n 1-138-1682 X 

11-36-3628 310-7-1150 1-11*1-259 u-M-gBh 

12-39-11(1(9 12-1(1-892 



Fig. 9.28. The sets of articles included in the clusters 
for Category 2. 
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Answers to Requests: 



A[y(b.)]=A 1 


1=1,6,11,20 


A[Y(b 9 )]=A lU 


A[Y(b 2? )]=A 2 




A[Y(b 12 )]=A l5 


A[Y(b ? )]=A 3 




A[Y(b 28 )]«A l6 


A[Y(b l7 )]-A^ 




A[Y(b 29 )]=A 1? 


A[Y(b lU )]=A 5 




A[Y(b 26 )]=A l8 


A[Y(b 3Q )]=A 6 




A[Y(b 2 ^)]=(b 25 ) 


AtYfb^)]^ 




A[Y(b 1Q )]=A 19 


A[Y(b l5 )]=Ag 




A[Y(b 13 )]=A 2Q 


A[Y(b l8 )]=A 9 




A[Y(b 5 )]=A 21 


A[Y(b l6 )]-A 1Q 




A[Y(b 2 )]=A 22 


A[Y(b 23 )]=A i;L 




A[Y(b.)]=A 23 


A[Y(b.)]=A 12 


i=22,2U 


A[Y(b u )]=A 2U 


A[Y(b 9 )]=A 13 


Definit. 


Lons of Clusters: 



i=3,21 



A 2 =A 1 \J(b 2? )UE 

A 3 =A 2 y(b ? )UF 

V (b 3Vi7 )UGU(d U e U ) 

A r A i^ ( V UH 
A 6 =A 5 u(b 2 b 20 b 30 d 5 d 7 f 3 )UJ 

A 7= A 6U^ l3 b l9Vl k 2 ) 

Ag^yCb^...^) 

A 9 =A oU( b i8 )UM 

A io =A 9 U<VU N 

A 11 =A 10U (* 2 i*23 f 5 )UP 



A ir (b 9 )UR 
A l5 =(b 12 n 2 )US 

A l6 =(b 28 g 26 m l5 )UT 



A 1? =(b 29 ) 



A 12 = (b 22 b 2lVU ) 



A l8 =(b 26 )UV 

A 19 =(b 10 b lU b 17 g 10 g 19 S 23 g 26 h 2 J 2 j iA 
k U k 7 k 10 k 13 m ll n l n 6 n 7 ) 

A 20 =(b 13 b 17 b 19 g 3 g a g li* g 17 g l8 s 19 g 21 g 22 g 26 
h 2 h 3 h Ii J 7 k 3 k [t k 5 k 6 k 11 k lli m 12 n u ) 

^i^V^Wii^U" 

A 22 =(b 2 b 17 b 20 d 5 d 7 V3 g 2- ' ' g 6 g 12- * - g l5 

g 17 g l8 g 21 g 23 g 25 g 27 h 2 h 3 h U ,3 l' ' • j 6 J ll ) 



A 23 =(b 3 b lU b l8 b 21 b 30 f 5 g 5 g l5 g l8 g 27 g 29 
A,, = (b fl )UQ . h lJ8Jqkoni2^1f2) 

13 8 A 2r (A 23l jby n(snr) 

Fig. 9.29- Answers to selected requests for Category 2. 
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\ b 6 b ll b 20 



J 27 



d l**- d l7 



6_ # » **~n 



f vH 





Fig. 9.30. Relationship of answer clusters for Category 2, 
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9«$1 Simple Request 

This test was performed in cooperation with a research physicist 
from Lincoln Laboratory. His initial request consisted of the following 
relatively brief specification: 

words : turbulence 

subsonic ~| 
hypersonic > perhaps 
wake J 

authors: Lees 

Hromas 

articles: none 

No articles were found which were written by the two authors 
(actually there were three papers by a Lees but in a completely 
different area). There were 70 articles that had either "turbulence 
or "turbulent in their titles (set T of Fig. 9-3l). There were 27 
which contained one or more of the words "wake, "subsonic", or "hyper- 
sonic". (Set W of Pig. 9.31.) 

At this point a number of the articles in Set T were used as 
requests to the clustering procedure. The cluster structure shown in 
Pig. 9.32 and 9.33 resulted. The physicist was asked to evaluate the 
pertinence of each of the articles presented. He gave three types of 
responses: pertinent (y), non-pertinent (n), and questionable perti- 
nence (m). The responses are indicated in Fig. 9.31 and also in Fig. 
9.32 by the superscripts. It will be noted that nine of the twelve 
articles specified as pertinent are in the A, cluster. 

The physicist was asked if there was any detectable difference 
between the article in the A, and A- clusters which were disjoint by 
the procedure. Of the 16 articles in A,, 15 were from Russian journals, 
while 27 of the 35 articles in A, were from American journals. It was 
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T T (Con't.) W D 

11-36-2075 y 799-6-1016 m l-13u-58l H-36-3609 y 

11-36-2201 n 799-6-101*8 m 1-135-1761 17-32-298 n 

21-31-11*1 n 799-6-1250 n 1-138-931* 669-I8-698 11 

29-30-17 y 799-6-1260 n 3-82-669 669-I8 -lOlii n 

l*l-ll*-8l3 n 799-6-1693 m 11-36-31* 669-19-U99 n 

Ul-llt-892 n 799.7.190 n 1*1-10-127 669-19-1165 n 

1*1-15-381 n 799-7-335 m 1*1-13-1*37 669-20-135 n 

l t 9_9_li t i 1 n 799.7-562 m 1*1-12-592 790-10-605 n 

1*9-12-201 y 799-7-629 m 1*1-13-71*2 799-6-1603 n 

1*9-13-297 m 799-7-816 m 1*1-15-31*6 

1*9-18-221* n 799-7-1030 m 1*9-19-1*59 

80-19-lJt30 n 799-7 -10U8 m 80-18-288 

381t-32-292 n 799-7-1156 y 80-l8-l5l5 

61*6-7-285 y 799-7 -ll60 m 61*6-i*-28 

669-16-295 n 799-7-1163 m 61*6-7-187 

669-16-1578 n 799-7-1169 m 799-6-9U6 

669-17-1*03 m 799-7-1178 m 799-6-1388 

669-17 -1UU9 n 799-7-1191 n 799-7-197 

669-18-81*7 n 799-7 -lU03 n 799-7-667 

669-18-1251 n 799-7-1723 y 799-7-111*7 

669-18-1268 m 799-7-1735 y 799-7 -1198 

669-19-3U9 m 799-7-1920 n 799-8-1*1* 

669-20-1^5 n 799-8-391 n 799-8-211 

669-20-1519 n 799-8-1*92 n 799-8-956 

669-21-710* y 799-8-575 m 799-8 -H*28 

669-21-771* m 799-8-598 y 799-8-11*56 

669-21-1161 n 799-8-1063 m 799-8-1792 

790-6-882 n 799-8-1509 n 

790-6-1017 m 799-8-161*7 n 

790-7-31*1* n 799-8-1659 n 

790-8-51* n 799-8-1775 m 

790-9-1057 n 799-8-1792 y 

790-9-11*29 n 799-8-2219 y 

790-10-191 n 799-8-2225 n 

790-10-101*1 11 821-2-332 n 



Fig. 9.31. Sets of articles included in the 
clusters for Physicist 1. 
(ypertinent, n»non -pertinent, 
m«questionable pertinence ) 



202 



t 62 t 6U t 6$ t 68 d 9 ) 

-A. i-U6,U7,li9>50,55, 
1 60,62, 6U,65,68 

AtKd^]-^ 

A[Y(t 36 )]-A 1 U(t 36 ) 

A[Y(t w )]«A 1 U(t 36 t li8 ) 

A ^(t 6l )3-A 1 U(* 3 6\8 t 52 t 6l ) 
A[y(t 5l )]-A 1 U(t 36 t w t 52 t 6l t 5 ^ 

A[Y ( t i) ] -(V23 t 2l, t 25 t 26 t 27 

-A 6 U i-19,2li,25,26,27 
AtYtd^l^y i-3,it,5 

A[Y(t 32 )]«A 6 y(t 32 ) 

A[Y(t 1? )].A 6 U(t 32 t 22 t 17 ) 
AtYttgJl^ytt^t^t^tg) 

A[Y(t l6 )]-A 6 y(t 32 t 22 t 17 t l6 ) 

AMt^Mt^t^t^) 
i-37,66 



A[Y(t 31 )]-(t 31 t 3U t 65 ) 
A[Y(t 33 )]-(t 33 t 3Q t 65 ) 
A[Y(t 1 )]-(t 38 t U3 t 5 g) 1-38,1*3,58 

ACY^a-ty^) 

A[Y(t 28 )]-(t l6 t 28 ) 
A[Y(t 5l4 )]-(t 5U v 17 ) 

AWdgJMdjt^) 

AtYfxJl-fd^) x^,^ 

AtKt^Mtgt^) i-2,69 

AtKt^l-ft^) i-3,12 

A[Y(t i )]-(t 5 t 20 ) i-5,20 

AtY(t i )J-(t 9 t 23 ) i-9,23 

AtY^JMt^t^) i-21,22 

A[Y(t 1 )]-(t 3lt t 39 ) i-3U,39 

A[Y(t i )]-(t 53 t 5? ) i-53,56 

AMt^Wt^t^) i-lU,56 

A[Y(t,)]-(t.) 1-1,6,7,10,11,1$, 
1 i 29,30,35,lt0,Ul, 

ii2,l*li,li$,59,63 



Fig, 9.32. Answers to selected requests for Bjysicist 1. 
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m m y m m n n y 
1*7*U9 50 55*60*62 *6U*65*68 



t m t m t y 
'^8*51*52*61 



n m n m y n m 
13*33*37*38*1^3*56*58*66 



n ,n n ,n 

31*3U*39 2 



<«£> 



n .m m n 
16 *17 22 32 



n t n t y t m t n 
23*211*25*26^27 



,n ,n ,n 
d 3 % d 5 



& 18 *70 




w, W. 



i w il* w i5 



Fig. 9.33. Relationship of answer clusters for Physicist 1. 

(y-pertinent, n=non-pertinent, m=questionatole pertinence) 



20U 



initially thought that the cause of the separation of the two clusters 
was probably due to the fact that the Russians generally cited Russians 
while the Americans cited Americans. After examining the two sets, the 
physicist expressed the opinion, however, that A_ appeared to he more 
concerned with the upper atmosphere and ionosphere. 

Also supporting the contention that there is a valid and useful 
distinction between A- and A ? is the fact that nine of the eleven 
articles judged to be pertinent were from the A, cluster. 

Because of the incompletely inverted files and the delays caused 
thereby, the actual searches were performed by the author of this 
thesis and later discussed with the physicist. It was interesting to 
note that at one point in the discussion, he stated that he could have 
more correctly shaped the final cluster by being able to specify as non- 
pertinent some articles on turbulence in helium that appeared in one of 
the clusters. 

We note in passing that the physicist who aided in this test is 
the author of article t,--. 

9. $2 Expand Extensive Bibliography 

In this section an example is given of how the clustering procedure 
might be used to supplement or extend an already sizable collection of 
papers on a given subject. 

A bibliography of 112 articles on Langmuir probes was supplied to 
the author by another research physicist at Lincoln Laboratory. Of the 
112 articles, 89 are to journals, 5U are to the 25 journals covered by 
the T.I. P. file, and 21 are actually in the T.I. P. file. The identifi- 
cations of the 21 articles in the T.I. P. file are given in Fig. 9-3U. 
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Fig. 9.35 shows the distribution of the articles in the file with time. 
Fig. 9.36 lists the words occurring in five or more of the 112 titles. 
In this list words such as "of, "the", "theory", etc., have been omitted, 
Also words have been grouped by stem. Thus, the words, "ion", "ions", 
"ionized", etc., are all grouped under the word, "ion". 



Set B 



3-82-21*3 

11-3U-1165 

11-3U-3209 

11-35-1130 

11-36-337 

11-36-675 



B (Con't.) 

11-36-1866 

11-36-2363 

21-30-182 

21-30-193 

21-30-375 



B (Con't.) 

1*9-11-126 

80-18-260 

80-18-1908 

690-8-720 

799-6-11*79 



B (Con't.) 

799-6-11*92 

799-it-1^33 

799-7-181*3 

799-8-56 

799-8-73 



Fig. 9»3l*. 21 Articles in Langmuir Probe that are in 
T.I.P. file. 



Number of Articles 

28 " 




Fig. 9-35. Publication year distribution of initial 
Langmuir Probe bibliography. 
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Words 



Number of articles 



probe 


87 


plasma 


ko 


Langmuir 


35 


ion 


18 


gas 


15 


discharge 


13 


electron 


12 


collection 


10 


density 


8 


low 


7 


pressure 


6 


spherical 


6 


electrostatic 


6 


probe and plasma 


32 


probe and Langmuir 


35 


probe and ion 


16 


probe and gas 


7 


probe and discharge 


6 



Pig. 9.36. Title word distribution for the 112 titles of 
the initial Langmuir probe bibliography. 



As an additional part of this test it was decided that five other 
types of search strategies would also be used and their results would 
be compared to the results of clustering. The five search strategies 
selected will now be described. 
TITLE WORD SEARCH 

One possible search strategy would be to retrieve all those 
articles which have some word or logical combination of words in their 
titles. The choice of the word or words to be Used was made on the 
basis of the frequency of occurrence of the words in the bibliography 
(Pig. 9.36) and in the T.I.P. file and with the advice of the physicist. 
Several test runs were made with various word combinations. A simple 
request for all articles with the word, "probe" , in their titles was 
selected. This retrieved 58 articles including 20 members of the 
original "bibliography. 
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AUTHOR SEARCH 

There are lilt different authors of the 112 articles in the biblio- 
graphy. A search of the T.I. P. file for articles by these lib authors 
yielded 120 articles (21 from the original bibliography and 99 other 
papers). This search was not exhaustive but involved looking for 
authors only in those journals where it was thought they might publish. 
CITATION SEARCH 

The third type of search consisted of finding all of the articles 
that cite one or more of the 112 articles in the bibliography. A 
search of the T.I. P. file using this criteria yielded 78 articles. 
BIBLIOGRAPHIC COUPLING SEARCH 

When two papers cite one or more of the same papers they are said 
to be bibliographically coupled (Sec. 6.22). There are 270 articles 
that are bibliographically coupled to one or more of the 21 articles 
in set B of Fig. 9.3l». 

The coupling strength between two papers is defined to be the 
number of identical citations that they have. The coupling strength 
between one paper and a set of papers is defined to be the number of 
citations in the single paper which are also found in one or more of 
the papers in the set. In Fig. 9*37 we show the distribution of the 
270 articles by their coupling strength to the set B. 
JOINTLY CITED SEARCH 

Bibliographic coupling occurs between two papers if they cite 
one or more of the same papers. Another type of coupling occurs if 
two papers are cited by one or more of the same papers. There are 
60$ papers which occur in one or more bibliographies with articles of 
set B. Of the 605, 101 are in the T.I. P. file. 
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Number of 

Articles 

1000 



100 



10 



9 10 11 Coupling 
Strength 



Fig. 9.37. Distribution of articles with various bibliographic 
coupling strengths. 



CLUSTERING 

The user specified the article b „ as the article of greatest 
interest in the bibliography. The articles b & , b Qi b l6J and b.^ were 
ranked next in terms of interest. The clusters which resulted when 
these and various other articles were used as requests to the system 
are shown in Fig.'s 9.38 - 9.U0. 



11-311-1097 

55-1+1-132 

80-19-1915 

612-2-719 

799-7-1329 

799-8 -7U8 

E 
3-83-971 
11-36-3135 
11-36-31U2 
11-37-180 



E (Con't.) 

iil-11-310 
lji-l5-286 
6U6-U-186 

F 

3-81-682 

11-36-3U2 

11-36-2361 

11-36-3526 

612-3-18 

790-7-788 



G 
3-83-U73 
11-35-130 
55-U1-391 
55-H-1U05 
790-7-921 

H 
799-7-110 
799-3-920 
799-8-2097 



11-35-136^ 
790-10-1102 
799-6-1762 
799-7 -18 3U 

K 
80-18-U26 
80-18-1056 

80-20-81*5 

612-2-58 

M 

11-37-377 



Fig. 9-38. The sets of articles included in the clusters 
for Langmuir Probe Bibliography (Physicist 2). 
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Answers to Requests: 



A[Y( ti ) 
A[Y(D i ) 
A[Y(b.) 
A[Y(b 3 ) 
AtYC^) 
A[Y(b 19 
A[Y(b 5 ) 
A[Y(b 2 ) 
A[Y(b 1Q 



=k ± 1=1^,16,17 
»A 2 i-1,7 
«A 3 i-8,9,11 

= A u 

-A^ i=U,6,20,21 

^ A 6 
" A 7 

]=(cluster of 82 articles) 



A[Y(b 12 )]-A 9 
A[Y(b l5 )]-A 10 
AfrO^)]*^) 1=13,18 
A[Y(b 6 b 8 b l6 b 17 b 19 )]^ All 

A t Y (V3V6 b 7 b 8 b 9 b H b l^l6 b 17 

b 19 b 20*21 )] " A 12 
A[Y(d i )]=A 1 i-1,.,.,6 

A[Y(e i )]-A 2 i-l,3,...,6 

A[Y(e 2 )]-Ag 



Definitions of Clusters : 

V (b 8Vl6 b 17 )UD 
A 2 -(b 1 b 7 b 8 b aJi )Ul 

A 3 K(b 3 b 8 b 9 b ll b l9 )U(d A d 5 ) U F 
A ^(b 3 b 8 b 9 )U(f 1 f 2 f li )Uo 

V ( V6Vl6 b 20 b 21 )U(d 2«l 8 U ) 

V< b l6 b l7 b l9 b 20 b 21 )UH 
^"(b^JUK 



A 8 B( Vl9Vl e 2V2 )lJJ 
V (b 12 b lU e i e 2 m l ) 

A io- (b i5 f 5 J 2 ) 

A 12" A 1 U A 2 U A 3 U\ U A 5 U< Vl J 2 ) 
A 13 =A 12 U(V 3 V 



Fig. 9.39. Answers to selected requests for Langmuir Probe 
Bibliography (Physicist 2). 



210 




Fig. 9.U0. Relationship of Clusters for Langmuir Probe 
Bibliography (Physicist 2). 



COMPARISON 

The six preceding search strategies produced a total of about $00 
different articles. It was decided that this constituted too large a 
file to ask the user to evaluate. The file was, therefore, reduced to 
the 10U articles which appeared to have the greatest chance of being of 
interest to the user. These included the 83 articles which were retrieved 
by two or more of the six search strategies, the 1$ additional articles 
which were bibliographically coupled to the set B with a value of three 
or more and another six articles which contained the word, "probe", in 
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their titles in the sense of a measuring device. In seven other 
articles the word, "probe", was found in the title but it was used as 
a synonym for investigation (e.g. "three-field model as a probe of 
higher group symmetries"). 

The lOlt articles presented for evaluation are listed in Fig. 9»kl. 
Die first column (A) is the identification. The next column (B) con- 
tains an indication (l) of those articles which are members of set B. 
The next six columns (C-H) note which articles were retrieved by each 
of the six search strategies; 

C - Column contains a one if the paper has the word, "probe", in 
its title. 

D - lumber of authors of the paper that are also authors of 112 
papers in the Bibliography. 

E - Number of the 112 papers in the Bibliography that are cited by 
the paper. 

F - Bibliographic coupling strength of the paper to the set B. 

d - Number of papers which cite the paper and also cite one or 
more of the 112 papers in the Bibliography. 

H - Symbol of the paper in the clusters of Fig. 9*38 to 9*1*0. 

(Rote that the counts in Columns D and F do not include the authors 

or citations which match only because the article itself is in the 

set B.) 
The last column (J) contains the evaluation code. Each document was 
assigned to one of the following five categories: 

1 - Of personal interest to user. 

2 - Of general interest. 

3 - Perhaps of general interest. 

(e.g. a probe may have been used as a tool in the experiment.) 
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A BCDEFGH J A BODEFttH J 

1-129-llBl ZZZZfJZ J U1-15-101B --1-1-- 3 

1-132-11*35 - - 1 - 3 1 - 3 U9-U-135 - - 1 - 1 - - 5 

1-132-11*1*5 - - 1 - it 2 - 3 1*9-5-214 - - 1 - - 2 - 5 

i-U2-2363 --11--- 3 U9-H-126 112-31^2 

1-132-2551* 3 - - 3 U9-19-118 - l - 1 - - - 2 

1-I3l*-I2l5 6 - - 3 ^9-20-7 - 1 - - - - - 5 

1-137-31*6 U - - 3 U9-20-269 - 1 - - 1 - - 1 

1-138-101$ - - 1 - 1 - - 3 g-Ul-132 - - - - 3 - d 1 

1-1UO-7U8 3-- 3 grftift. ~ J -~"t"S3l 

1-11*0-778 1* - - 3 55-1*1-11*05 3 - g£ 3 

l-JJl-ii6 U - - 5 55-1*1-1980 - l u 2 

3-81-682 6 - f . 3 80-18-260 1 1 3 1 - - b^l 

3-82-2U3 11-3 2 3^ 1 S - 1 ?^ --l- 2 , 1 ^ 

lla3-li73 nig 5 80-18-558 3 - - 5 

3-83-971 .-■■ e 2 80-18-1056 - - 1 - 2 - k g 5 

3.8U-133 U - - 3 80-19-566 - - l - l - - 5 

11-31,-1665 11- -lib- 1 80-19-1908 1 1 2 2 3 - b. 1 

11-3U-1897 -1U5U 1 80-19-1915 - 1 1 2 2 - <^ 2 

ll-3k-26l3 - 1 1 2 80-19-2313 - - 1 1 1 - - 3 

S-3U-3209 11.131*3 1 80-20-8U5 --1-1-^5 

11-35-130 - 1 - l 8 1 d 1 16U-37-2U1 - 1 - - - - - 5 

11-35-1130 li-ili? 1 612-2-58 - - l - 1 - K 5 

11-35-1365 - - 1 - 1 - il 3 612-2-719 -1-16 1*2 

11-36-337 1 bj 3 612-3-18 - - - 1 8 1 f 5 5 

11-36-31*2 - 1 - - 6 - f5 5 gJilfto " i p 2 

11-36-1*35 - 1 1 - 1 - - 2 612-3-789 - 1 2 2 

U- 6-675 1 1 1 - 2 - b 6 l 6U6-U-186 - 11 1 : | 1 « ; ? 2 

n-36-1659 - 1 5 ggii^ii, " x x : 3 : : J 

11-36-1866 1 1 2 - 2 - b- 1 669-16-887 i - * 

11-36-2361 -12-2-f 1 790-6-91*7 ---111- 3 

11-36-2363 il--8-b 1 790-6-990 - 1 \ 

U-36-2672 - 1 - - 1 - - 8 1 790-7-580 - - - 1 2 1 - 5 

S-36-3135 - - 1 - 9 - e. 2 790-7-788 - 1 - - 3 - f 6 1 

11-36-31U2 .llUle 2 790-7-921 ---111^5 

11-36-3526 - 7 - f? 3 790-8-319 - 1 1 " 5 

S-36.37to - - 1 - 1 - > 3 790-8-720 1 1 - - - 1 b^l 

11-37-180 - 1 1 - 2 - e, 1 790-9-961 ---11-- 3 

t, „ ?1 t -11I1I1--2 790-10-1102 3-J 2 3 

ii:37-377 1-2 12--, 3 799-6-11,79 1 1 2 1* 1* 6 b£ 1 

11-37 il9 - - 1 - 2 - - 1 3 799-6-11,92 1 1 1 3 3 7 W 

17 27 -67fc - - 1 - 1 - - U 799-6-1762 2 2 j^ 2 

2 7 lI 9 -93 --i-1-- 3 799-7-110 - - - j* 2 1 £ 2 

21-29-1165 - 1 1 1 799-7-1329 2 7 - d $ 5 

21-29-1313 -11 1 p-7-1^33 ll----b^l 

21-30-182 1132113b 1 ?99-7-l5l7 " " \ 1 } " " I 

21-30-193 113112*,! 799-7-1831* - - 1 - 1 " K \ 

2^30-375 1 1 3 h 20 1 b£ 1 J99-7-18U3 11 - U - b* 1 

21-30-2021 - - 1 - 3 - - 3 799-8-56 IJo 

21-31-1632 - - 1 - 1 - - k T99-8-73 1 "V 

1*1-11-310 - - 1 - 1 2 e c 2 799-8-71*8 - 1 1 1 1 - d 6 1 

£-13-83 - 1 5 5 799-8-920 --1111^3 

£-15-286 2 - e 6 3 799-8-2097 - - 1 2 2 - h 3 3 

Fig. 9.1*1. Langmuir Probe papers evaluated by physicist. 

(Explanations of columns are given in text.) 



213 



h - Degree of interest cannot be determined by examination of the 
author ( s ) . 

5 - Not of interest. 

In Fig. 9.U2 the results of each of the six search strategies are 
tabulated for comparison. The results for bibliographic coupling are 
separated into two entries depending on the coupling strength. 

An examination of Pig. 9.hZ indicates that the search strategies 
using the author, citation, and cited-by-same criteria yield compara- 
tively large sets of documents containing relatively few of the articles 
judged to be of specific pertinence by the user (evaluation category l). 

Bibliographic coupling with the coupling strength greater than or 
equal to one yields such a large set of articles (270 ) that it would be 
more appropriate to compare it with a larger cluster such as the 85- 
article cluster which contained 26 of the category-1 documents. I«t us 
therefore compare cluster A,, with the set of articles with coupling 
strength greater than or equal to two. It will be seen that A^ is less 
than half as large and yet contains three more of the category-1 docu- 
ments . 

It will be observed that the clustering procedure uses the same 
data used in bibliographic coupling but in a different way. Consider, 
for example, the 27 articles in A^ which are not part of the original 
bibliography. Seven have a coupling strength to B of only 1 and six 
have a coupling strength of 2. Whereas an articles like 1-129-1181 
with a coupling strength of 7 is not included in A^. 
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Search Strategy 



Number of articles in each 
Number of articles evaluation categor y 
retrieved 



1 


2 


3 


h 


5 


30 


11 


1 


2 


6 


18 


10 


15 


2 


8 


16 


7 


8 





5 


19 


10 


19 
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Title word 

Author 

Citation 

Bibliographic coupling 
(strength _ 2 

Bibliographic coupling 
(strength _ l) 

Clted-by-same articles 

Clustering (A,,) 



58 

120 

78 

88 

270 

101 

1*3 



26 12 29 



15 



13 8 h 7 
22 8 7 6 



Total 



abt. 500 



31 16 32 li 21 



Fig. 9-U2. Comparison of results of seven search strategies. 



Let us now turn our attention to the title word search. Fig. 9.U2 
incidates that this search strategy retrieved four more of the category- 
1 documents than were retrieved by the search strategies based on 
citations (i.e. bibliographic coupling and the 85-docuaent cluster). 
This result provides an example of a case where title words provide a 
better basis for retrieval than do citations. Previous experience 
would indicate that such is not generally the case. 

To determine why the clustering procedure was less effective in 
this case the five category-1 documents which did not appear in any of 
the clusters generated were examined. It was found that three of them 
(b., , b.£, and 21-29-1165) contain only a single citation and the other 
two (b.g and 21-29-1313) contain only two citations. He are thus led 
to the same conclusion arrived at earlier that the clustering system, 
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in general, has trouble properly placing documents with three or fewer 
citations. 

The remedy for this difficulty would he to use some additional 
types of partitioning data. In the example at hand, all 31 of the 
category-1 documents could be retrieved in the same cluster if the 
system used not only the partitions generated by citations but also 
those generated by certain keywords like "probe". 

One other observation may be worth noting. The article, b^, was 
part of the original bibliography but was not included in any clusters 
with other members of the bibliography. A check of its bibliography 
showed that it had nine citations,which experience indicated should be 
enough to place it in the correct cluster. The author of this thesis 
decided, therefore, to ask the physicist if b^ was in a different area 
from the other 20 members of the bibliography. Before this was asked, 
ihowever, the evaluation of the 10U articles of Fig. 9.1*1 was made. A 
: check of this evaluation revealed that 19 of the 21 members of the 
original bibliography were placed in evaluation category 1 while b^ 
was placed in category 3. 

9.6 Summary of Results 

For purposes of comparison and emphasis let us summarize some of 
the significant features of the last three sections. In Fig. 9.U3 two 
measures of the success of the clustering procedure are tabulated. 
Column four indicates how many of the pertinent articles were retrieved 
by the clustering system in each test. Column five indicates what 
fraction of the articles retrieved were pertinent. The particular clus- 
ter selected for each test is specified in parenthesis in column three. 
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Name of Test 

Bibliography 1 
(Sec. 9.31) 

Bibliography 2 
(Sec. 9.32) 

Bibliography 3 (ill) 
(Sec. 9-33) 

Bibliography 3(lV) 
(Sec. 9.33) 

Bibliography 3(lIIC) 
(Sec. 9.33) 

Category 1 
(Sec. 9.1a) 

Category 2 
(Sec. 9.U2) 

User 1 
(Sec. 9.S1) 

User 2 
(Sec. 9.52) 



Number of 

papers Size of 

specified Related 

as pertinent Cluster 



10 



17^) 



16 6U(A U ) 
27 U8(A 12 ) 



9 31(Aq) 
13 22 (A^) 

U3 10^ 

(A 9 UA 2 ^)A 26 ) 
30 133 

I2(y) 59(A 10 ) 



31(1) U3(A 13 ) 



Percent of Percent of 
pertinent cluster 
papers in specified as 
cluster pertinent 



9/10-90 % 9/17 =S3# 



lli/l6=88 lb/6U=22 



20/27=7U 20/U8-li2 



8/9=89 8/31-26 



10/13=77 10/22=U6 



28/li3=65 28/105=27 



19/30=6U 19/l33-li» 



9/12-75 9/59-15 



22/31=71 22/1*3=51 



Fig. 9»li3. Summary of the experimental results of 
Sections 9.3-5. 



One additional statistic may be of interest. This relates to 
whether the documents that are pertinent to a search are added to the 
cluster early or late in the process. For this purpose 50 clusters 
from Sec. 9«33 and 9'Ul were analyzed and the number of articles of 
specified pertinence added in each quarter of the process was noted. 
These figures were averaged for the 5>0 clusters. The results are 
shown in Fig. 9»Uk. It will be seen that on the average almost half 
(hi %) of the pertinent articles which are included in the final 
cluster are added during the first quarter of the process. 
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Average percent 
of bibliography 
added per 
quartile 



5o 



Uo 
30 
20 
10 



0-l/li l/li-l/2 1/2 -3A 3A-1 



Quartile of 

Clustering 

Process 



Fig. 9.hk> Graph showing average percent of bibliography 
(or category) articles added during e^ich 
quartile of the clustering process. 



.^li^BHas'Ef^llsfiv* 



218 



CHAPTER X 
CORCUJSIONS 

In this chapter we shall make some initial comments concerning the 
adequacy of the various components of the experimental system. Then 
certain conclusions about the clustering procedure will be given. Next 
the effectiveness of the overall model and system in retrieveing useful 
sets of documents will be evaluated. In the final section some possible 
avenues for further research will be suggested. 

1 0.11 MAC Time-Sharing System 

After five years' experience with batch processing computers, the 
author of this thesis found the MAC time-sharing system a refreshing 
change with some significant advantages. Let us briefly comment on the 
use of the MAC system in three areas: in debugging programs, in test- 
ing and evaluating systems, and in operational retrieval functions. 
EEBUGQIRG 

It is estimated that the use of the MAC system cut by a factor of 
somewhere between two and ten the amount of time required to debug the 
experimental program. This, of course, is due to the fact that turn- 
around time for a run with time -sharing is of the order of a few 
minutes, whereas with batch processing it is usually several hours or 
days. 

The availability of more sophisticated debugging routines would 
have reduced debugging time even further. Some features that would 
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have been of special help are multiple break points, conditional break 
points, an interpretive mode, more convenient patching, automatic up- 
dating of the English text, etc. 

One problem in using time-sharing for debugging is that it is 
almost too easy to make changes to a program and re-run it. This 
results in one making a change before its consequences have been fully 
considered. Part of the answer to this problem lies in self discipline 
on the part of the programmer. It will also help when a computer be- 
comes available on a 2^-hour basis so one is not tempted to try to rush 
through a change before a maintenance or test session. 

Two minor improvements to the consoles would help. A less noisy 
console would allow the user to more effectively contemplate a problem 
at the same time the computer is printing out some results on the con- 
sole. Also a neon light showing when the console is being serviced by 
the central processor would be of considerable value. 
SYSTEM TESTING 

After one has obtained a program that is debugged and performs 
according to specification, it often becomes apparent that the original 
specifications for the program need changing. This may result in some 
modifications to the program, or if the change is extensive, it may 
require rewriting the whole program. The same advantages and problems 
that time-sharing has in debugging are also in evidence in this cycle 
of program specification and respecification. 
OPERATIONAL RETRIEVAL 

Let us now consider what would happen if one were to decide to use 
the MAC system or one like it as an operational information retrieval 
system serving a community of real users. 
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If all of IBM 1302 disc were used for data, a file 30 times the 
size of the current T.I.P. file could be stored. This would allow one 
to increase the time span covered by the periodical literature from 3 
to perhaps 10-lS years and also add some non-periodical literature. 
All of the files could also be completely inverted. There would 
probably still be room left for coverage of another discipline about 
the size of physics. If magnetic tapes were used, coverage could be 
increased even further by loading the disc with different data on 
different days of the week. 

Let us assume that the current limit of 30 users on line at once 
is maintained. The response time for simple requests for information 
would probably be acceptable to most users. This would be 1 second of 
computer time and 1-30 seconds of real time. The response time to 
more complex requests would probably be found objectionable to some 
users. Retrieval of a cluster, for example, might take UO-50 seconds 
of computer time and 5-10 minutes of real time. 

The response time to complex requests could be improved by a 
factor of 5-10 if the supervisory system were modified to allow some 
type of direct access to the disc. The current supervisory program is 
designed for the storage of files that are constantly changing. This 
places a penalty factor of 5-10 of the accessing of files that never 
change, such as those found in a library. 

One of the biggest difficulties with using the MAC system as an 
information retrieval service is that it has no provision for the trans- 
mission, display and reproduction of analog information. Such a 
capability would probably be needed, for example, if the system were to 
supply the abstracts or total text of articles. 
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Thus, with the current system a person with a console in his 
office might he able to identify which articles are of interest, hut 
he would still have to go to the library to get them. (He could per- 
haps have his own microfilm system, but this would be very expensive.) 

10.12 T.I. P. Document Collection 

The first tests of the clustering procedure were performed using 
a single volume of the Physical Review . As the data base was increased, 
some marked changes in the characteristics of the procedure were noted. 
One of the major causes of these changes was the fact that the parti- 
tioning sets for the single volume are all quite small, whereas the 
partitions for the total T.I. P. file have a wide range of sizes. 

The question arises as to whether an increase of perhaps one or 
two orders of magnitude in the current document file might further 
change the way the procedure operates. In an attempt to answer this 
question, let us first note that such an increase would necessarily 
involve coverage of some additional branches of science such as 
chemistry, mathematics and/ or electrical engineering. This would be 
true since a sizeable fraction of the significant physics periodical 
literature that is being published is already being added to the T.I. P. 
file. This implies that the size of the clusters generated by the 
procedure would not significantly change even if the size of the 
collection were greatly increased. 

Also the use of an inverted data storage system would keep the 
access time to any one piece of information relatively constant even 
when the size of the file were measurably increased. It is, therefore, 
concluded that the system would operate in essentially the same way it 
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currently does even if the document file were scaled up in size by 
several orders of magnitude. 

10.13 Partitions 

The experimental results as summarized in Fig. S.h'i are evidence 
of the fact that partitions based on citation information constitute a 
useful data base for the measure of relatedness and the clustering 
procedure. There were, of course, a few documents which were not in- 
cluded in the cluster to which it appeared they should belong. In 
almost all of these cases it was found that the documents had three or 
fewer citations which was evidently an insufficient number to properly 
place them in their appropriate cluster. 

From this, one might conclude that the clustering system as 
presently programmed may not be an effective retrieval tool for a file 
in which a large fraction of the documents have three or fewer cita- 
tions. Actually what may be needed in such a file is a modification in 
the type or types of partitioning information utilized so that parti- 
tions are also generated by users, title words, authors or some other 
parameter(s) . A case where other types of partitionings would have 
helped even in the citation-rich T.I. P. file was described in Sec. 9-52. 

10. lit Storage Structure 

One general conclusion that was reached in this project is that in 
a dynamic system an attempt should be made to give the data a general 
structure instead of a structure tailored to one specific requirement . 
This will allow a flexible approach to new uses of the data. An in- 
verted file structure coupled with the raw data file was suggested as a 
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possible general filing system. 

It is argued in See. 7.22 that an inverted file should occupy- 
about the same amount of storage as is occupied by the file which is 
being inverted. This claim was verified for the data in the T.I. P. 
file. 

10.1$ Retrieval Language 

The fact that both the syntax and vocabulary of the retrieval 
language is table -driven (i.e. they are specified by tables) was con- 
sicdered to be a significant advantage. As modifications in the 
structure of the request and in the words used to describe the request 
suggested themselves, they were easily incorporated into the system by 
a minor modification in the appropriate table. 

Currently no one besides the author of this thesis has had 
sufficient experience with the retrieval language to evaluate it. Let 
me, therefore, make some admittedly biased observations. 

First, the language was found to be easy to remember even after a 
lapse of several months in which it was not used. The language was also 
found to have considerable room for future growth. Indeed a large 
number of additional verbs and adjectives that would be useful in 
retrieval suggested themselves. The ability to make a request for 
information as complex or as simple as needed was also found helpful. 
Actually only a maximum of about three or four levels of structure has 
been utilized so far. 
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10.2 Evaluation of Procedure 

In this section we shall discuss whether the procedure as described 
in Chapter V has the general characteristics which it needs for opera- 
tion as a retrieval tool. An evaluation of the actual utility of the 
current procedure and experimental system in satisfying user requests 
will be discussed in the next section. 
CONVERGENCE 

Considerable difficulty was encountered with the earlier cluster- 
ing procedures because they occasionally entered into a non-terminating 
cycle. The steps taken to prevent such cycles have been described in 
Sec. $.53. The experience gained over the past several months supports 
the contention that the current procedure will always converge in a 
finite number of iterations to an answer cluster or to a comment that 
the request is inconsistent. 
GENERAL-SPECIFIC 

From Fig. 9.3 one can conclude that the use of a bias in the 
correlation network doe% indeed, allow one to increase or decrease the 
size of the answer cluster. That the value to be given the bias can be 
automatically determined by the composition of the request has been 
experimentally verified by the results of Sec.'s 9.3-5. 
AMBIGUITY RESOLUTION 

In Chapter IX examples are given showing how some of the possible 
answer clusters that satisfy a given request can be eliminated by 
specifying additional documents to be of interest or not of interest 
(additions to the Y and Z sets). It is clear that one can arrive at a 
point at which only one cluster satisfies the request by the appropriate 
additions to the Y and Z sets. From Fig. 9.7 one might conclude that 
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on the average at least two members of Z are required to make a request 
unambiguous. Of course, even if the request is ambiguous, the desired 
answer cluster may still be found. For example, in Sec. 9.31 seven 
out of the ten requests with Y-(b.) resulted in A., and yet all seven 
are ambiguous. 
INCONSISTENCY RECOGNITION 

From the results of Fig. 9.5 we conclude that not only does the 
procedure mark as inconsistent those requests for which there is no 
answer cluster, hut it also decides that some of the requests are 
inconsistent, for which a valid answer cluster exists. This difficulty 
is not considered serious, however, since the user can he coupled into 
the system and can guide the procedure in the right direction and 
reshape the request if an inconsistent situation is reached. 

10.3 Evaluation of System 

In the last section several conclusions were stated concerning the 
characteristics of the clustering procedure. In this section we will 
discuss the more general problem of the effectiveness of the overall 
system as a retrieval tool. 

From Fig. 9.ii3 we note that the percent of pertinent documents 
retrieved by clustering ranges from 6k to 90 %. This compares favor- 
ably with a published retrieval efficiency of about 50# for other 
automatic retrieval systems. 

Almost all of the pertinent documents which were not retrieved 
were found to have three or fewer citations. This would give one the 
hope that with an expanded data base for the partitions the 6J4-9O % 
retrieval efficiency could be improved even more. 
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We next note from Fig. 9-U3 that from hi to 86 £ of the retrieval 
documents are not part of the set of documents of known pertinence. 
Let us assume for a moment that all of these documents are irrelevant. 
Many users would still find this acceptable since a quick examination 
of the titles could be used to select the articles of interest from 
the larger set. 

Now let us consider whether or not some of the additional articles 
might really be found to be of interest by a user who has selected the 
cluster in which they are found. 

First, we observe that for the tests of Sec. 9-3 some of the 
articles in the clusters were published after the October IEEE Proceed- 
ings came out and thus had no chance of being part of the bibliographies 
even if they were pertinent. This is the case, for example, with the 
following documents of Fig. 9.21: d^, e^, k^, k 12 , k^, k^, m^,..., 

m l8> "27 ' P 3' q 3' V and S* 

Also the authors of the three bibliographies used probably did not 

intend to exhaustively cover the area. They may have only selected 
what they considered to be the best reference(s) available for each 
specific concept or topic. 

These arguments do not hold for the articles added by the cluster- 
ing procedure to the categories of Sec. 9-^U- The categories are 
supposedly exhaustive and should include all but the most recent 
articles. In defense of the additional articles in the clusters let 
us give two examples. The first title below is included in the 
Physical Review category on "Luminescence" while the second is not. 
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1-133-1163 

Optical properties of cubic SiC, luminescence of nitrogen- 

exciton complexes, and interband absorption. 

1-133-2023 

Optical properties of 1I>R SiC, luminescence of nitrogen- 

exciton complexes, and interband absorption. 

As a second example, consider cluster A, of Sec. 9.1(2. This 

cluster contains three articles that are classified in the category, 

"Erbium", in Physics Abstracts . Of the 31 other articles in the 

cluster three contain the word, "erbium", in their title and seven 

more contain the word, "erbium", in the abstract or text. All of the 

remaining articles have at least one of the other Ik rare earth elements 

mentioned in the title. The following is an example of an article 

contained in the cluster A, but not included in the erbium category. 

1-126-726 + 

Energy levels and crystal-field calculations of Er, in 
yttrium aluminum garnet. J 

For the tests with users described in Sec. 9«5 the percentage of 
the cluster that is pertinent would be 27/59"U6# for User 1 and 
27/lt3=86# for User 2 if all of the articles of questionable (or 
general) pertinence were counted. The user might even find some of 
those articles judged non-pertinent to be of interest if he were 
allowed to examine the actual article instead of just the title. 

The foregoing arguments and data suggest that a user might, on the 
average, find at least half of the documents in a cluster of interest. 

It is perhaps significant that the percentage of pertinent docu- 
ments retrieved is lower in the tests for the two categories than for 
the other tests. The other tests involved bibliographies compiled by 
experts (authors and users) while the categories were generated by 
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indexers. 

One might also note that the tests of Sec. 9-3 have higher per- 
centages of pertinent documents retrieved on the whole than do the 
tests of Sec. 9«5>. This could he explained by the fact that the users 
of Sec. 9.5 based their decisions on the titles, authors, and citations 
of the articles, while the authors of Sec. 9.3 had undoubtedly read the 
articles they cited. The conclusion to be reached here is that the 
clustering procedure tends to do best in those tests where it was 
compared to sets generated by the careful consideration of experts. 

In conclusion, the experience of this thesis indicates that 
clustering may be a useful tool to research workers who desire informa- 
tion covering either a very specific or a very broad area of interest. 
It is our opinion that further development and research is both 
warranted and essential. 

10. k Suggestions for Further Research 

The suggestions to be presented here have been divided into 
three general categories: 

(1) Data base and data structure 

(2) Clustering procedure and interaction language 

(3) Theoretical problem 

10. Ill Data Base and Structure 
OTHER DATA BASES 

It has already been suggested (Sec. 10.13) that the clustering 
system should be tested on other types of partition data. Some of the 
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other types of partitions that might be tried are listed in Sec. 6.22. 

It is also suggested that tests he made of the simultaneous use of 
several types of partitioning data. In this connection one might 
consider the use of a weighting factor for the partitions which might, 
for example, give a larger weight to partitions generated "by citations 
than to those generated by title words. 

Of particular interest would be a system which utilized the type 
of usage data described in Chapters II and III. 
CHANGING FILE 

There are a number of questions relating the fact that a document 
collection is continually changing. What should happen when documents 
are added to or deleted from the file? Can the user be automatically 
notified of new documents of interest? In this connection one might 
want the user to permanently store those clusters found to be of 
interest. Then as nwe documents come into the file they can be com- 
pared against the clusters. The user would then be notified of those 
articles which were valid members of his clusters. 
CODING 

There is also need for additional work on the problem of data 
coding and compression. For example, one might be able to reduce 
storage requirements considerably by storing codes for all (or certain) 
authors' names in the raw data file. This may be true of the other 
types of data also. 

10.ii2 Procedure and language 

There are a number of directions in which the clustering procedure 
and interaction language might be extended. One objective might be to 
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make a wider class of statements acceptable and understandable to the 
system. This might involve increasing the vocabulary and/or allowing 
other syntactic forms. 
PARSING BY CONTEXT 

As a specific suggestion we note that the current system determines 
the function of (parses) a word by a simple table look-up. A word 
cannot have a dual function depending on its context. Thus if one wants 
to use "p" as an abbreviation for print (p. the titles of set l), this 
would currently exclude its use say as an abbreviation for paper or as 
the initial in an author's name ("get articles by 'P. A. Jones'" would 
however be acceptable). It should be possible, however, to distinguish 
between these different uses, if one utilizes the context. 
GRAPHIC DISPLAY 

A more radical extension of the language would be through the use 
of some type of graphical device. For example, it might prove useful to 
display part of the document network on an oscilloscope and to allow the 
user to specify the interesting and non-interesting documents by means 
of a light pen. 

In addition to increasing the flexibility of the language, one 
might also want to allow the specification of some other functions. Let 
us suggest some additional functions that the clustering procedure 
might appropriately perform. 
CLUSTER SIZE 

A user might want to limit the size of the answer cluster to some 
specified range at the outset, (e.g. "Get between 3 and 7 articles 
related to Phys. Rev. v. 136 p. 1899- " ) This could be accomplished by 
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increasing or decreasing the Mas enough so that the size of the answer 
cluster fell within the specified range. 
DATA BASE 

It would also be of value to a user if he could specify the type of 
partitioning data to be used by the clustering procedure. Thus the 
command, "Get the articles related by authors and users to Phys. Rev. 
Letters v. 11 p. 6", would use the partitions generated by both authors 
and usage data to create the answer cluster. This control could be 
extended to select for the data base certain classes of partitions 
within a broad type. For example, a request of the type, "Get the 
articles related by M.I.T. faculty users to Phys. Letters v. 7 p. lli", 
would allow the user to single out for use that type of partitioning 
which he thought would yield the best results. 
CLUSTERS OF AUTHORS, ETC. 

There is no real reason why clusters must be limited to sets of 
documents. It may be useful to generalize the system to allow clusters 
to be formed of other types of entities such as authors, locations, 
words, etc. It might be very helpful, for example, to be able to deter- 
mine the cluster of scientists that are working in a given field or area. 

10. U3 Theoretical Problems 
ANSWER CLUSTER DEFINITION 

Some modification to the definition of an answer cluster may be of 
value. For example, should a change be made to the requirement that all 
the documents specified as interesting be in the cluster? 
NOISE 

There will, of course, be cases where certain documents are 
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mistakenly included together in a set of interest. Tnis may arise, for 
example, from an incorrect judgement on the part of a user or perhaps 
by a clerical slip. The effect of this type of noise on the system 
should be investigated. Also suitable steps should be taken to maintain 
the integrity of the data base through editing processes. 
SELF-SUSTAIHIHG RUTS 

Consider an information retrieval system which is based on the 
data generated by its users. This might be one based on usage data or 
on citations. Is it possible in such a system for a self -reinforcing 
feedback loop to be created which cannot be altered? For example, if 
users are supplied documents on the basis of past use, this may create 
new partitions which only serve to reinforce the results of the old 
partitions . 
EVALUATION MEASURE 

The measure described in Chapter III was not suggested for use in 
rating the merit or value of documents. Its function was to group 
together documents that were mutually pertinent. If a suitable way 
could be devised for measuring the worth of documents, this would be of 
considerable aid to users. Perhaps this would take the form of some 
type of concensus of opinion of the previous users of the documents. 
TRAILS VS. SETS 

In the article already cited by V. Bush the model suggested for 
information retrieval was a trail leading from one pertinent document 
to the next. The model used in this research endeavor is the partition- 
ing of the file into two subsets. Actually both models have useful 
features. In some cases there is a definite pattern or trail which 
should be followed in consulting the documents related to a given 
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subject. In other cases the order in which the documents should he 
examined is apparent from their publication data. In still other cases 
there is no particular order in which the documents need be consulted. 
Thus it would seem that one might want to include both the ideas of 
sets of documents and trails of documents in a more general information 
retrieval model. 
PREDICTIVE USAGE 

As additional information becomes available on the types of 
questions that are asked by users and the sets of documents that seem 
to satisfy them, it may be possible to design a system involving some 
form of prediction of what a user really wants when he asks a given 
question. This might even be extended to involve trends in document 
usage, so that future document use is extrapolated on the basis of 
past use. 
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APPENDIX A 
MEASURES OF RELATEDNESS 

Some of the measures which have been proposed for use in informa- 
tion retrieval are tabulated below. Measures (l) to (6) were originally 
suggested in terms of frequency counts. Measures (7) and (8) were first 
proposed in terms of probabilities. For purposes of comparison we have 
attempted to express each measure in the table both in terms of 
probabilities and frequency counts. In the case of measure (5) this 
was not possible. 

The definitions for the symbols used in the table and the con- 
version formulae for going from probabilities to frequency counts and 
back again are found in Sec. 3.1. It was necessary to add superscripts 
to the frequency counts in the table to distinguish between some 
additional counts which appear in these measures. Thus N. , is the 
number of partitions in which the subset of interest contains document 
j but not i. 
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